3.7 Data Ingestion and Transformation — Kinesis, Glue, and EMR
Key Takeaways
- Kinesis Data Streams ingests real-time streaming data with sub-second latency; consumers process data in real time (Lambda, KCL applications).
- Kinesis Data Firehose delivers streaming data to destinations (S3, Redshift, OpenSearch, HTTP) with automatic batching and optional transformation via Lambda.
- AWS Glue is a serverless ETL service with a Data Catalog for metadata management, crawlers for schema discovery, and Spark-based jobs for data transformation.
- Amazon EMR runs Apache Spark, Hadoop, Hive, and Presto on managed clusters for big data processing at petabyte scale.
- Use Kinesis for real-time streaming; Glue for serverless ETL; EMR for complex big data processing with Apache frameworks.
Data Ingestion and Transformation — Kinesis, Glue, and EMR
Quick Answer: Kinesis Data Streams = real-time ingestion (you manage consumers). Kinesis Data Firehose = near-real-time delivery to S3/Redshift/OpenSearch (fully managed). Glue = serverless ETL + Data Catalog. EMR = managed Spark/Hadoop clusters for big data. Choose based on latency requirements and operational overhead tolerance.
Amazon Kinesis
Kinesis is a platform for streaming data on AWS, making it easy to collect, process, and analyze real-time data.
Kinesis Services Comparison
| Service | Purpose | Latency | Management |
|---|---|---|---|
| Kinesis Data Streams | Real-time data ingestion and processing | ~200ms | You manage consumers |
| Kinesis Data Firehose | Delivery to destinations (S3, Redshift, etc.) | 60 seconds minimum | Fully managed, no consumers to build |
| Kinesis Data Analytics | Real-time analytics with SQL or Apache Flink | Seconds | Managed analytics engine |
| Kinesis Video Streams | Ingest and process video streams | Seconds | Managed video ingestion |
Kinesis Data Streams
| Feature | Detail |
|---|---|
| Shards | Each shard: 1 MB/s in, 2 MB/s out, 1000 records/s in |
| Retention | 24 hours (default) to 365 days |
| Consumers | Lambda, KCL applications, Kinesis Data Firehose, Kinesis Data Analytics |
| Ordering | Per-shard ordering using partition key |
| Replay | Consumers can replay data within retention period |
Kinesis Data Firehose
| Feature | Detail |
|---|---|
| Destinations | S3, Redshift, OpenSearch, Splunk, HTTP endpoints |
| Buffer | Configurable buffer size (1-128 MB) and interval (60-900 seconds) |
| Transformation | Optional Lambda function for data transformation |
| Compression | Automatic (GZIP, ZIP, Snappy) |
| No consumers | Fully managed — just configure source and destination |
On the Exam: "Real-time processing of streaming data" → Kinesis Data Streams + Lambda consumer. "Load streaming data into S3 with minimal management" → Kinesis Data Firehose.
AWS Glue
Glue is a serverless data integration service for ETL (Extract, Transform, Load).
Glue Components
| Component | Description |
|---|---|
| Data Catalog | Central metadata repository (databases, tables, schemas) |
| Crawlers | Automatically discover data schemas from S3, RDS, DynamoDB, etc. |
| ETL Jobs | Spark-based data transformation jobs (Python/Scala) |
| Triggers | Schedule or event-based job execution |
| Glue Studio | Visual ETL job designer |
| DataBrew | Visual data preparation (no code) |
When to Use Glue
- Serverless ETL — No infrastructure to manage
- Data Catalog — Central schema registry for your data lake
- Schema discovery — Crawlers auto-detect data formats and schemas
- Spark-based processing — Complex transformations at scale
Amazon EMR (Elastic MapReduce)
EMR provides managed clusters running Apache big data frameworks.
Supported Frameworks
| Framework | Use Case |
|---|---|
| Apache Spark | Large-scale data processing, ML, streaming |
| Apache Hadoop | MapReduce batch processing |
| Apache Hive | SQL-like queries on large datasets |
| Apache Presto | Interactive SQL queries across multiple data sources |
| Apache HBase | NoSQL database on top of HDFS |
| Apache Flink | Real-time stream processing |
EMR Node Types
| Node | Description |
|---|---|
| Primary | Manages the cluster, coordinates tasks |
| Core | Runs tasks AND stores data (HDFS) |
| Task | Runs tasks only (no data storage, can use Spot Instances) |
EMR vs. Glue
| Feature | EMR | Glue |
|---|---|---|
| Management | You manage clusters | Serverless |
| Frameworks | Spark, Hadoop, Hive, Presto, HBase, Flink | Spark (PySpark) |
| Control | Full (install custom software) | Limited |
| Cost | EC2 instances + EMR fee | Per DPU-hour |
| Best for | Complex big data pipelines | Simple ETL, data catalog |
A company needs to continuously load clickstream data from their website into Amazon S3 with minimal operational overhead. Which service should they use?
A data engineering team needs to discover and catalog metadata (schemas, tables, partitions) from data stored across multiple S3 buckets and RDS databases. Which service should they use?