⚙️

ETL and Big Data - Glue, EMR, Kinesis

Glue, EMR, Data Pipeline and data processing

⏱️ Estimated reading time: 25 minutes

AWS Glue - Serverless ETL

AWS Glue is a serverless ETL (Extract, Transform, Load) service to prepare and transform data for analysis.

Main Components:

Glue Data Catalog:
- Central metadata repository
- Stores schemas, tables, partitions, locations
- Used by Athena, Redshift Spectrum, EMR
- Automatic schema versioning

Glue Crawlers:
- Automatically scans data sources
- Infers schemas and creates tables in Data Catalog
- Detects partitions and updates metadata
- Supports S3, RDS, DynamoDB, JDBC sources

Glue ETL Jobs:
- Auto-generated Spark or Python code
- Transforms data between formats
- Auto-scales based on data volume
- Pay per second of DPU (Data Processing Unit) used

Features:
- Serverless: No infrastructure to manage
- Auto-scaling: Adjusts workers automatically
- Job bookmarks: Processes only new data
- Development: Visual ETL designer or code editor

🎯 Key Points

✓ Glue Data Catalog is AWS central metadata store
✓ Crawlers run on-demand or scheduled
✓ ETL jobs can be Spark (Scala/Python) or Python Shell
✓ Glue Studio offers visual drag-and-drop interface
✓ Job bookmarks prevent data re-processing

💻 Glue job to transform data

# Glue ETL Job - Convert CSV to Parquet
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Read data from Data Catalog
datasource = glueContext.create_dynamic_frame.from_catalog(
    database = "my_database",
    table_name = "sales_csv"
)

# Transform: filter and add column
transformed = datasource.filter(lambda x: x["quantity"] > 0)

# Write as partitioned Parquet
glueContext.write_dynamic_frame.from_options(
    frame = transformed,
    connection_type = "s3",
    connection_options = {"path": "s3://my-bucket/sales-parquet/"},
    format = "parquet",
    format_options = {"compression": "snappy"},
    transformation_ctx = "write_parquet"
)

job.commit()

AWS Glue DataBrew

Glue DataBrew is a visual no-code data preparation service to clean and normalize data.

Features:
- Visual: Point-and-click no-code interface
- 250+ transformations: Cleaning, normalization, filtering
- Data profiling: Automatic data quality analysis
- Recipes: Reusable transformation sequences
- Preview: See results before applying to full dataset

Use Cases:
- Clean data before analysis or ML
- Normalize formats (dates, names, addresses)
- Detect and handle missing values
- Identify outliers and anomalies
- Prepare data without writing code

Differences from Glue ETL:
- DataBrew: Visual, no-code, for analysts
- Glue ETL: Code-based, more flexible, for engineers
- DataBrew generates Recipes that can be automated
- Both can integrate into pipelines

🎯 Key Points

✓ DataBrew ideal for analysts without programming experience
✓ Data profiling automatically identifies quality issues
✓ Recipes can be exported and reused
✓ Supports sampling to work with large subsets
✓ Can write results to S3, Redshift, Data Catalog

Amazon EMR - Elastic MapReduce

Amazon EMR is a big data platform to process large amounts of data using open-source frameworks like Hadoop, Spark, Hive, Presto.

Supported Frameworks:
- Apache Spark: In-memory processing, ML, streaming
- Apache Hadoop: MapReduce, HDFS
- Apache Hive: SQL on Hadoop
- Apache HBase: Distributed NoSQL database
- Presto: Interactive SQL queries
- Flink, Phoenix, Zeppelin, Livy

Architecture:
- Master Node: Coordinates cluster, manages YARN ResourceManager
- Core Nodes: Execute tasks, store data in HDFS
- Task Nodes: Only execute tasks, no storage (optional)

Deployment Options:
- Transient clusters: Created, process data, terminated
- Long-running clusters: Stay active for continuous workloads
- EMR Serverless: No cluster management, auto-scaling
- EMR on EKS: Run Spark jobs on Kubernetes

Storage:
- HDFS: Local on Core Nodes (ephemeral)
- EMRFS: S3 as persistent storage (recommended)
- Instance Store: For temporary results
- EBS: For Core Nodes HDFS

🎯 Key Points

✓ EMR cheaper than running self-managed Spark/Hadoop
✓ Use S3 (EMRFS) for persistent storage, not HDFS
✓ Spot Instances ideal for Task Nodes (no data loss)
✓ EMR Notebooks for interactive development with Spark
✓ Auto-scaling can add/remove Task Nodes based on load

💻 Create and run jobs on EMR

# Create EMR cluster with Spark
aws emr create-cluster \n  --name "My-Spark-Cluster" \n  --release-label emr-6.10.0 \n  --applications Name=Spark Name=Hive \n  --instance-type m5.xlarge \n  --instance-count 3 \n  --use-default-roles \n  --log-uri s3://my-bucket/emr-logs/ \n  --bootstrap-actions Path=s3://my-bucket/bootstrap.sh

# Add step (job) to cluster
aws emr add-steps \n  --cluster-id j-XXXXXXXXXXXXX \n  --steps Type=Spark,Name="Data processing",\nActionOnFailure=CONTINUE,\nArgs=[--deploy-mode,cluster,--master,yarn,\ns3://my-bucket/spark-job.py,s3://my-bucket/input/,\ns3://my-bucket/output/]

Amazon Kinesis Data Streams

Kinesis Data Streams enables capturing, processing, and storing real-time data streams.

Key Concepts:
- Stream: Set of shards processing data
- Shard: Capacity unit (1MB/s in, 2MB/s out)
- Producer: Sends records to stream (SDK, Agent, Firehose)
- Consumer: Reads records from stream (Lambda, EC2, Kinesis Analytics)
- Record: Data + Partition Key + Sequence Number

Capacity:
- Provisioned mode: Manually specify number of shards
- On-demand mode: Automatically scales based on throughput
- 1 shard: 1MB/s in, 2MB/s out, 1000 records/s

Retention:
- Default: 24 hours
- Extendible: up to 365 days
- Replay: Re-process historical data

Partition Key:
- Determines which shard gets the record
- Same partition key = same shard (order guaranteed)
- Distribute keys uniformly prevents hot shards

Use Cases:
- Real-time analytics (live dashboards)
- Log aggregation
- IoT data ingestion
- Clickstream analysis

🎯 Key Points

✓ Kinesis Data Streams is NOT serverless (provision shards)
✓ Data persists (not deleted when consumed) during retention period
✓ Enhanced Fan-Out allows multiple consumers without affecting throughput
✓ KCL (Kinesis Client Library) simplifies consumption with checkpointing
✓ ProvisionedThroughputExceeded = need more shards or better partition key

Amazon Kinesis Data Firehose

Kinesis Data Firehose is a fully managed service to load streaming data to storage and analytics destinations.

Features:
- Fully managed: No shard or scaling management
- Near real-time: 60s minimum buffer or 1MB minimum
- Auto-scaling: Handles any throughput
- Transformation: Lambda can transform records in transit
- Conversion: Can convert formats (JSON to Parquet/ORC)

Destinations:
- AWS: S3, Redshift, OpenSearch, Kinesis Data Analytics
- Third-party: Datadog, Splunk, New Relic, MongoDB
- HTTP Endpoints: Custom destinations

Buffer Configuration:
- Buffer size: 1MB - 128MB
- Buffer interval: 60s - 900s
- Firehose waits until size OR interval met (whichever first)

Transformations:
- Lambda can modify records before delivery
- Built-in conversion: JSON → Parquet/ORC
- Compression: GZIP, ZIP, Snappy

Difference from Kinesis Data Streams:
- Firehose = Delivery to destinations (load)
- Data Streams = Custom real-time processing

🎯 Key Points

✓ Firehose is serverless and fully managed
✓ Near real-time (not true real-time like Data Streams)
✓ Doesn't store data (only delivers to destination)
✓ For Redshift: first writes to S3, then COPY to Redshift
✓ Failed records can be sent to S3 backup bucket

AWS Data Pipeline vs Glue vs EMR

AWS offers multiple services for data processing. It's important to know when to use each.

AWS Data Pipeline:
- Data workflow orchestration
- Moves and transforms data between AWS services
- Based on EC2 or EMR instances (not serverless)
- Schedule-based execution
- Use cases: Regular backups, legacy batch processing

AWS Glue:
- Serverless ETL
- Ideal for simple to medium transformations
- Central Data Catalog
- Auto-scaling, pay per second
- Use cases: Data lake preparation, format conversion

Amazon EMR:
- Big data processing with open-source frameworks
- Requires cluster management (not serverless)
- Maximum control and flexibility
- Use cases: Complex ML, massive processing with Spark/Hadoop

When to use each:
- Glue: SQL-like transformations, serverless ETL, moderate cost
- EMR: Complex Spark/Hadoop jobs, ML, full control
- Data Pipeline: Legacy orchestration, on-prem → AWS migrations

🎯 Key Points

✓ Glue is simplest and serverless option for ETL
✓ EMR for workloads requiring full Spark/Hadoop
✓ Data Pipeline is legacy, consider Step Functions + Glue
✓ For streaming: Kinesis Data Streams or Firehose
✓ For batch: Glue (serverless) or EMR (more powerful)

← Back to AWS-SAA-C03