3.7 Data Ingestion and Transformation - Kinesis, Glue, and EMR

Key Takeaways

  • Kinesis Data Streams ingests real-time data with ~200 ms latency; each shard handles 1 MB/s or 1,000 records/s in and 2 MB/s out, retained 24 hours up to 365 days.
  • Kinesis Data Firehose is fully managed near-real-time delivery to S3, Redshift, OpenSearch, Splunk, or HTTP, buffering 1-128 MB or 60-900 seconds with optional Lambda transform.
  • AWS Glue is serverless Spark ETL plus a Data Catalog and crawlers that auto-discover schemas - the low-ops choice for ETL and metadata.
  • Amazon EMR runs managed Apache clusters (Spark, Hadoop, Hive, Presto, HBase, Flink) with primary/core/task nodes - task nodes can use Spot for cost.
  • Decision rule: real-time custom processing -> Kinesis Data Streams; managed delivery to a store -> Firehose; serverless ETL/catalog -> Glue; large custom big-data frameworks -> EMR.
Last updated: June 2026

Quick Answer: Kinesis Data Streams = real-time ingestion where you build/operate consumers. Kinesis Data Firehose = fully managed near-real-time delivery to S3/Redshift/OpenSearch. AWS Glue = serverless Spark ETL plus Data Catalog. Amazon EMR = managed Spark/Hadoop clusters for heavy, custom big-data work. Choose on latency and how much you want to operate.

Amazon Kinesis

Kinesis is the AWS family for collecting, processing, and analyzing streaming data.

ServicePurposeLatencyWho builds consumers?
Kinesis Data StreamsReal-time ingestion + custom processing~200 msYou (Lambda, KCL, etc.)
Kinesis Data FirehoseManaged delivery to destinations~60 s minimumNobody - fully managed
Kinesis Data AnalyticsReal-time SQL / Apache FlinkSecondsManaged engine
Kinesis Video StreamsIngest/process videoSecondsManaged video path

Kinesis Data Streams details

FeatureDetail
Shard capacity1 MB/s or 1,000 records/s in; 2 MB/s out per shard
Retention24 hours default, up to 365 days
OrderingGuaranteed per shard via partition key
ConsumersLambda, KCL apps, Firehose, Kinesis Data Analytics
ReplayRe-read any data still within the retention window

Scale a stream by adding shards (resharding). Because ordering is per shard, choose a partition key that distributes load evenly while keeping related records together.

Kinesis Data Firehose details

FeatureDetail
DestinationsS3, Redshift, OpenSearch, Splunk, HTTP endpoints
Buffering1-128 MB size or 60-900 second interval
TransformationOptional Lambda to transform records in flight
CompressionGZIP, ZIP, Snappy on the way to S3
ManagementFully managed - no consumers to build or scale

On the Exam: "Process streaming data in real time with custom logic" -> Data Streams + Lambda. "Load streaming data into S3/Redshift with minimal management" -> Firehose. The 60-second floor on Firehose is the cue when sub-second delivery is required (then use Data Streams).

AWS Glue

Glue is serverless data integration for ETL (extract, transform, load).

ComponentDescription
Data CatalogCentral metadata store (databases, tables, schemas) shared by Athena, Redshift Spectrum, EMR
CrawlersAuto-discover schemas/partitions from S3, RDS, DynamoDB, etc.
ETL jobsSpark (PySpark/Scala) transformations, no servers to manage
TriggersSchedule- or event-driven job orchestration
Glue Studio / DataBrewVisual ETL authoring and no-code data prep

Glue is the answer for serverless ETL and a central schema catalog for a data lake when you do not want to run clusters.

Amazon EMR (Elastic MapReduce)

EMR runs managed clusters of open-source big-data frameworks - Apache Spark, Hadoop, Hive, Presto, HBase, and Flink - at petabyte scale.

Node typeRole
PrimaryCoordinates the cluster and assigns work
CoreRuns tasks and stores data in HDFS
TaskRuns tasks only, no HDFS - safe to run on Spot

Running task nodes on Spot Instances is a common cost-optimization pattern because losing a stateless task node does not lose HDFS data.

EMR vs Glue

EMRGlue
ManagementYou manage clustersServerless
FrameworksSpark, Hadoop, Hive, Presto, HBase, FlinkSpark only
ControlFull (install custom software)Limited
Cost modelEC2 instances + EMR feePer DPU-hour
Best forComplex/custom big-data pipelinesSimple ETL + catalog

On the Exam: "Continuously load clickstream into S3 with minimal ops" -> Firehose. "Catalog schemas across S3 and RDS automatically" -> Glue crawlers + Data Catalog. "Run an existing Apache Hive/Presto pipeline at scale with full control" -> EMR.

Worked Scenario: Choosing Along the Pipeline

A retailer ingests millions of clickstream events per minute and wants both a real-time fraud check and a cheap path to analyze the same data later. The design splits by latency requirement. Send the raw stream into Kinesis Data Streams, where a Lambda consumer evaluates each event in roughly 200 ms for the real-time fraud signal. In parallel, attach Kinesis Data Firehose to deliver the same records into S3 in batches, where they land as compressed files. Then point a Glue crawler at the S3 prefix to populate the Data Catalog so analysts can query the lake with Athena.

If later the team needs heavy, custom Spark or Presto processing with installed third-party libraries, lift that batch layer onto EMR and run task nodes on Spot to cut cost.

Common Traps to Avoid

  • Firehose for sub-second processing. Firehose has a roughly 60-second minimum buffer; when the requirement is real-time (hundreds of milliseconds) custom processing, use Kinesis Data Streams.
  • Building consumers when Firehose suffices. If the goal is simply to land streaming data in S3/Redshift/OpenSearch with minimal ops, Firehose needs no consumers - do not over-engineer with Data Streams + KCL.
  • EMR when serverless ETL is enough. Standing up and managing EMR clusters is operational overhead; for straightforward Spark ETL and cataloging, Glue is serverless and simpler.
  • Shard limits. Each Kinesis shard ingests only 1 MB/s or 1,000 records/s; high-volume streams must be sharded enough, and the partition key should distribute load evenly while preserving per-shard ordering.
  • Core vs task nodes on Spot. Core nodes store HDFS data, so losing one risks data; run cost-saving Spot capacity on stateless task nodes instead.
Test Your Knowledge

A web app emits clickstream events that must be continuously delivered into Amazon S3 with the least operational overhead, including automatic batching and compression. Which service should be used?

A
B
C
D
Test Your Knowledge

A data team needs to automatically discover and catalog table schemas and partitions across many S3 buckets and several RDS databases so that Athena and Redshift Spectrum can query them. Which service is purpose-built for this?

A
B
C
D
Test Your Knowledge

An analytics platform must process streaming sensor data within a few hundred milliseconds using custom application logic and be able to replay the last several days of events on demand. Which service best meets these needs?

A
B
C
D