3.7 Data Ingestion and Transformation — Kinesis, Glue, and EMR

Key Takeaways

  • Kinesis Data Streams ingests real-time streaming data with sub-second latency; consumers process data in real time (Lambda, KCL applications).
  • Kinesis Data Firehose delivers streaming data to destinations (S3, Redshift, OpenSearch, HTTP) with automatic batching and optional transformation via Lambda.
  • AWS Glue is a serverless ETL service with a Data Catalog for metadata management, crawlers for schema discovery, and Spark-based jobs for data transformation.
  • Amazon EMR runs Apache Spark, Hadoop, Hive, and Presto on managed clusters for big data processing at petabyte scale.
  • Use Kinesis for real-time streaming; Glue for serverless ETL; EMR for complex big data processing with Apache frameworks.
Last updated: March 2026

Data Ingestion and Transformation — Kinesis, Glue, and EMR

Quick Answer: Kinesis Data Streams = real-time ingestion (you manage consumers). Kinesis Data Firehose = near-real-time delivery to S3/Redshift/OpenSearch (fully managed). Glue = serverless ETL + Data Catalog. EMR = managed Spark/Hadoop clusters for big data. Choose based on latency requirements and operational overhead tolerance.

Amazon Kinesis

Kinesis is a platform for streaming data on AWS, making it easy to collect, process, and analyze real-time data.

Kinesis Services Comparison

ServicePurposeLatencyManagement
Kinesis Data StreamsReal-time data ingestion and processing~200msYou manage consumers
Kinesis Data FirehoseDelivery to destinations (S3, Redshift, etc.)60 seconds minimumFully managed, no consumers to build
Kinesis Data AnalyticsReal-time analytics with SQL or Apache FlinkSecondsManaged analytics engine
Kinesis Video StreamsIngest and process video streamsSecondsManaged video ingestion

Kinesis Data Streams

FeatureDetail
ShardsEach shard: 1 MB/s in, 2 MB/s out, 1000 records/s in
Retention24 hours (default) to 365 days
ConsumersLambda, KCL applications, Kinesis Data Firehose, Kinesis Data Analytics
OrderingPer-shard ordering using partition key
ReplayConsumers can replay data within retention period

Kinesis Data Firehose

FeatureDetail
DestinationsS3, Redshift, OpenSearch, Splunk, HTTP endpoints
BufferConfigurable buffer size (1-128 MB) and interval (60-900 seconds)
TransformationOptional Lambda function for data transformation
CompressionAutomatic (GZIP, ZIP, Snappy)
No consumersFully managed — just configure source and destination

On the Exam: "Real-time processing of streaming data" → Kinesis Data Streams + Lambda consumer. "Load streaming data into S3 with minimal management" → Kinesis Data Firehose.

AWS Glue

Glue is a serverless data integration service for ETL (Extract, Transform, Load).

Glue Components

ComponentDescription
Data CatalogCentral metadata repository (databases, tables, schemas)
CrawlersAutomatically discover data schemas from S3, RDS, DynamoDB, etc.
ETL JobsSpark-based data transformation jobs (Python/Scala)
TriggersSchedule or event-based job execution
Glue StudioVisual ETL job designer
DataBrewVisual data preparation (no code)

When to Use Glue

  • Serverless ETL — No infrastructure to manage
  • Data Catalog — Central schema registry for your data lake
  • Schema discovery — Crawlers auto-detect data formats and schemas
  • Spark-based processing — Complex transformations at scale

Amazon EMR (Elastic MapReduce)

EMR provides managed clusters running Apache big data frameworks.

Supported Frameworks

FrameworkUse Case
Apache SparkLarge-scale data processing, ML, streaming
Apache HadoopMapReduce batch processing
Apache HiveSQL-like queries on large datasets
Apache PrestoInteractive SQL queries across multiple data sources
Apache HBaseNoSQL database on top of HDFS
Apache FlinkReal-time stream processing

EMR Node Types

NodeDescription
PrimaryManages the cluster, coordinates tasks
CoreRuns tasks AND stores data (HDFS)
TaskRuns tasks only (no data storage, can use Spot Instances)

EMR vs. Glue

FeatureEMRGlue
ManagementYou manage clustersServerless
FrameworksSpark, Hadoop, Hive, Presto, HBase, FlinkSpark (PySpark)
ControlFull (install custom software)Limited
CostEC2 instances + EMR feePer DPU-hour
Best forComplex big data pipelinesSimple ETL, data catalog
Test Your Knowledge

A company needs to continuously load clickstream data from their website into Amazon S3 with minimal operational overhead. Which service should they use?

A
B
C
D
Test Your Knowledge

A data engineering team needs to discover and catalog metadata (schemas, tables, partitions) from data stored across multiple S3 buckets and RDS databases. Which service should they use?

A
B
C
D