ETL
25 questions about ETL (Glue, EMR, Kinesis Data Firehose, Lambda, Athena, Lake Formation) for exam preparation.
A company needs to transform CSV files to Parquet format stored in S3, apply data cleansing and automatically catalog the schema. Which service provides this ETL capability with LOWEST operational overhead?
Category: Design High-Performing Architectures
Explanation
Detailed breakdown of the correct answer
AWS Glue ETL
AWS Glue is a fully managed serverless ETL service that automates data discovery, preparation, and combination.
Glue automatically generates PySpark/Scala code, runs jobs on managed infrastructure, and updates Data Catalog with metadata. Ideal for transformations without managing servers.
Therefore, the correct answer is: AWS Glue with ETL jobs and automatic Data Catalog.
The option that says: EMR cluster with Spark is incorrect because it requires manually provisioning, configuring and managing clusters, high operational overhead compared to serverless Glue.
The option that says: Lambda functions is incorrect because Lambda has 15-minute execution limit and limited memory, inadequate for ETL transformations of large files.
The option that says: EC2 with Python scripts is incorrect because it requires more management (patching, scaling, monitoring), full ETL code development and doesn't provide automatic Data Catalog.