3.7 Data Ingestion and Transformation - Kinesis, Glue, and EMR
Key Takeaways
- Kinesis Data Streams ingests real-time data with ~200 ms latency; each shard handles 1 MB/s or 1,000 records/s in and 2 MB/s out, retained 24 hours up to 365 days.
- Kinesis Data Firehose is fully managed near-real-time delivery to S3, Redshift, OpenSearch, Splunk, or HTTP, buffering 1-128 MB or 60-900 seconds with optional Lambda transform.
- AWS Glue is serverless Spark ETL plus a Data Catalog and crawlers that auto-discover schemas - the low-ops choice for ETL and metadata.
- Amazon EMR runs managed Apache clusters (Spark, Hadoop, Hive, Presto, HBase, Flink) with primary/core/task nodes - task nodes can use Spot for cost.
- Decision rule: real-time custom processing -> Kinesis Data Streams; managed delivery to a store -> Firehose; serverless ETL/catalog -> Glue; large custom big-data frameworks -> EMR.
Quick Answer: Kinesis Data Streams = real-time ingestion where you build/operate consumers. Kinesis Data Firehose = fully managed near-real-time delivery to S3/Redshift/OpenSearch. AWS Glue = serverless Spark ETL plus Data Catalog. Amazon EMR = managed Spark/Hadoop clusters for heavy, custom big-data work. Choose on latency and how much you want to operate.
Amazon Kinesis
Kinesis is the AWS family for collecting, processing, and analyzing streaming data.
| Service | Purpose | Latency | Who builds consumers? |
|---|---|---|---|
| Kinesis Data Streams | Real-time ingestion + custom processing | ~200 ms | You (Lambda, KCL, etc.) |
| Kinesis Data Firehose | Managed delivery to destinations | ~60 s minimum | Nobody - fully managed |
| Kinesis Data Analytics | Real-time SQL / Apache Flink | Seconds | Managed engine |
| Kinesis Video Streams | Ingest/process video | Seconds | Managed video path |
Kinesis Data Streams details
| Feature | Detail |
|---|---|
| Shard capacity | 1 MB/s or 1,000 records/s in; 2 MB/s out per shard |
| Retention | 24 hours default, up to 365 days |
| Ordering | Guaranteed per shard via partition key |
| Consumers | Lambda, KCL apps, Firehose, Kinesis Data Analytics |
| Replay | Re-read any data still within the retention window |
Scale a stream by adding shards (resharding). Because ordering is per shard, choose a partition key that distributes load evenly while keeping related records together.
Kinesis Data Firehose details
| Feature | Detail |
|---|---|
| Destinations | S3, Redshift, OpenSearch, Splunk, HTTP endpoints |
| Buffering | 1-128 MB size or 60-900 second interval |
| Transformation | Optional Lambda to transform records in flight |
| Compression | GZIP, ZIP, Snappy on the way to S3 |
| Management | Fully managed - no consumers to build or scale |
On the Exam: "Process streaming data in real time with custom logic" -> Data Streams + Lambda. "Load streaming data into S3/Redshift with minimal management" -> Firehose. The 60-second floor on Firehose is the cue when sub-second delivery is required (then use Data Streams).
AWS Glue
Glue is serverless data integration for ETL (extract, transform, load).
| Component | Description |
|---|---|
| Data Catalog | Central metadata store (databases, tables, schemas) shared by Athena, Redshift Spectrum, EMR |
| Crawlers | Auto-discover schemas/partitions from S3, RDS, DynamoDB, etc. |
| ETL jobs | Spark (PySpark/Scala) transformations, no servers to manage |
| Triggers | Schedule- or event-driven job orchestration |
| Glue Studio / DataBrew | Visual ETL authoring and no-code data prep |
Glue is the answer for serverless ETL and a central schema catalog for a data lake when you do not want to run clusters.
Amazon EMR (Elastic MapReduce)
EMR runs managed clusters of open-source big-data frameworks - Apache Spark, Hadoop, Hive, Presto, HBase, and Flink - at petabyte scale.
| Node type | Role |
|---|---|
| Primary | Coordinates the cluster and assigns work |
| Core | Runs tasks and stores data in HDFS |
| Task | Runs tasks only, no HDFS - safe to run on Spot |
Running task nodes on Spot Instances is a common cost-optimization pattern because losing a stateless task node does not lose HDFS data.
EMR vs Glue
| EMR | Glue | |
|---|---|---|
| Management | You manage clusters | Serverless |
| Frameworks | Spark, Hadoop, Hive, Presto, HBase, Flink | Spark only |
| Control | Full (install custom software) | Limited |
| Cost model | EC2 instances + EMR fee | Per DPU-hour |
| Best for | Complex/custom big-data pipelines | Simple ETL + catalog |
On the Exam: "Continuously load clickstream into S3 with minimal ops" -> Firehose. "Catalog schemas across S3 and RDS automatically" -> Glue crawlers + Data Catalog. "Run an existing Apache Hive/Presto pipeline at scale with full control" -> EMR.
Worked Scenario: Choosing Along the Pipeline
A retailer ingests millions of clickstream events per minute and wants both a real-time fraud check and a cheap path to analyze the same data later. The design splits by latency requirement. Send the raw stream into Kinesis Data Streams, where a Lambda consumer evaluates each event in roughly 200 ms for the real-time fraud signal. In parallel, attach Kinesis Data Firehose to deliver the same records into S3 in batches, where they land as compressed files. Then point a Glue crawler at the S3 prefix to populate the Data Catalog so analysts can query the lake with Athena.
If later the team needs heavy, custom Spark or Presto processing with installed third-party libraries, lift that batch layer onto EMR and run task nodes on Spot to cut cost.
Common Traps to Avoid
- Firehose for sub-second processing. Firehose has a roughly 60-second minimum buffer; when the requirement is real-time (hundreds of milliseconds) custom processing, use Kinesis Data Streams.
- Building consumers when Firehose suffices. If the goal is simply to land streaming data in S3/Redshift/OpenSearch with minimal ops, Firehose needs no consumers - do not over-engineer with Data Streams + KCL.
- EMR when serverless ETL is enough. Standing up and managing EMR clusters is operational overhead; for straightforward Spark ETL and cataloging, Glue is serverless and simpler.
- Shard limits. Each Kinesis shard ingests only 1 MB/s or 1,000 records/s; high-volume streams must be sharded enough, and the partition key should distribute load evenly while preserving per-shard ordering.
- Core vs task nodes on Spot. Core nodes store HDFS data, so losing one risks data; run cost-saving Spot capacity on stateless task nodes instead.
A web app emits clickstream events that must be continuously delivered into Amazon S3 with the least operational overhead, including automatic batching and compression. Which service should be used?
A data team needs to automatically discover and catalog table schemas and partitions across many S3 buckets and several RDS databases so that Athena and Redshift Spectrum can query them. Which service is purpose-built for this?
An analytics platform must process streaming sensor data within a few hundred milliseconds using custom application logic and be able to replay the last several days of events on demand. Which service best meets these needs?