A web app emits clickstream events that must be continuously delivered into Amazon S3 with the least operational overhead, including automatic batching and compression. Which service should be used?

Kinesis Data Firehose with S3 as the destination. Firehose is fully managed delivery to S3 that batches and compresses automatically with no consumers to build, minimizing operations. Kinesis Data Streams would require building and scaling a KCL consumer, EMR adds cluster management, and a Glue crawler discovers schemas rather than continuously delivering streaming data.

A data team needs to automatically discover and catalog table schemas and partitions across many S3 buckets and several RDS databases so that Athena and Redshift Spectrum can query them. Which service is purpose-built for this?

AWS Glue Data Catalog with crawlers. Glue crawlers auto-discover schemas and partitions from S3, RDS, and other sources and populate the central Glue Data Catalog, which Athena and Redshift Spectrum consume directly. Athena and Redshift Spectrum query data but do not discover and centrally catalog it, and a per-cluster Hive metastore is not the shared serverless catalog described.

An analytics platform must process streaming sensor data within a few hundred milliseconds using custom application logic and be able to replay the last several days of events on demand. Which service best meets these needs?

Kinesis Data Streams with consumer applications. Kinesis Data Streams offers roughly 200 ms ingestion latency, lets you run custom consumers for bespoke processing, and retains data up to 365 days so consumers can replay events. Firehose has a ~60-second delivery floor and no replay, Glue is batch ETL, and Athena queries data at rest rather than processing a live stream.

Data Ingestion and Transformation - Kinesis, | Free Guide 2026

Quick Answer: Kinesis Data Streams = real-time ingestion where you build/operate consumers. Kinesis Data Firehose = fully managed near-real-time delivery to S3/Redshift/OpenSearch. AWS Glue = serverless Spark ETL plus Data Catalog. Amazon EMR = managed Spark/Hadoop clusters for heavy, custom big-data work. Choose on latency and how much you want to operate.

Amazon Kinesis

Kinesis is the AWS family for collecting, processing, and analyzing streaming data.

Service	Purpose	Latency	Who builds consumers?
Kinesis Data Streams	Real-time ingestion + custom processing	~200 ms	You (Lambda, KCL, etc.)
Kinesis Data Firehose	Managed delivery to destinations	~60 s minimum	Nobody - fully managed
Kinesis Data Analytics	Real-time SQL / Apache Flink	Seconds	Managed engine
Kinesis Video Streams	Ingest/process video	Seconds	Managed video path

Kinesis Data Streams details

Feature	Detail
Shard capacity	1 MB/s or 1,000 records/s in; 2 MB/s out per shard
Retention	24 hours default, up to 365 days
Ordering	Guaranteed per shard via partition key
Consumers	Lambda, KCL apps, Firehose, Kinesis Data Analytics
Replay	Re-read any data still within the retention window

Scale a stream by adding shards (resharding). Because ordering is per shard, choose a partition key that distributes load evenly while keeping related records together.

Kinesis Data Firehose details

Feature	Detail
Destinations	S3, Redshift, OpenSearch, Splunk, HTTP endpoints
Buffering	1-128 MB size or 60-900 second interval
Transformation	Optional Lambda to transform records in flight
Compression	GZIP, ZIP, Snappy on the way to S3
Management	Fully managed - no consumers to build or scale

On the Exam: "Process streaming data in real time with custom logic" -> Data Streams + Lambda. "Load streaming data into S3/Redshift with minimal management" -> Firehose. The 60-second floor on Firehose is the cue when sub-second delivery is required (then use Data Streams).

AWS Glue

Glue is serverless data integration for ETL (extract, transform, load).

Component	Description
Data Catalog	Central metadata store (databases, tables, schemas) shared by Athena, Redshift Spectrum, EMR
Crawlers	Auto-discover schemas/partitions from S3, RDS, DynamoDB, etc.
ETL jobs	Spark (PySpark/Scala) transformations, no servers to manage
Triggers	Schedule- or event-driven job orchestration
Glue Studio / DataBrew	Visual ETL authoring and no-code data prep

Glue is the answer for serverless ETL and a central schema catalog for a data lake when you do not want to run clusters.

Amazon EMR (Elastic MapReduce)

EMR runs managed clusters of open-source big-data frameworks - Apache Spark, Hadoop, Hive, Presto, HBase, and Flink - at petabyte scale.

Node type	Role
Primary	Coordinates the cluster and assigns work
Core	Runs tasks and stores data in HDFS
Task	Runs tasks only, no HDFS - safe to run on Spot

Running task nodes on Spot Instances is a common cost-optimization pattern because losing a stateless task node does not lose HDFS data.

EMR vs Glue

	EMR	Glue
Management	You manage clusters	Serverless
Frameworks	Spark, Hadoop, Hive, Presto, HBase, Flink	Spark only
Control	Full (install custom software)	Limited
Cost model	EC2 instances + EMR fee	Per DPU-hour
Best for	Complex/custom big-data pipelines	Simple ETL + catalog

On the Exam: "Continuously load clickstream into S3 with minimal ops" -> Firehose. "Catalog schemas across S3 and RDS automatically" -> Glue crawlers + Data Catalog. "Run an existing Apache Hive/Presto pipeline at scale with full control" -> EMR.

Worked Scenario: Choosing Along the Pipeline

A retailer ingests millions of clickstream events per minute and wants both a real-time fraud check and a cheap path to analyze the same data later. The design splits by latency requirement. Send the raw stream into Kinesis Data Streams, where a Lambda consumer evaluates each event in roughly 200 ms for the real-time fraud signal. In parallel, attach Kinesis Data Firehose to deliver the same records into S3 in batches, where they land as compressed files. Then point a Glue crawler at the S3 prefix to populate the Data Catalog so analysts can query the lake with Athena.

If later the team needs heavy, custom Spark or Presto processing with installed third-party libraries, lift that batch layer onto EMR and run task nodes on Spot to cut cost.

Common Traps to Avoid

Firehose for sub-second processing. Firehose has a roughly 60-second minimum buffer; when the requirement is real-time (hundreds of milliseconds) custom processing, use Kinesis Data Streams.
Building consumers when Firehose suffices. If the goal is simply to land streaming data in S3/Redshift/OpenSearch with minimal ops, Firehose needs no consumers - do not over-engineer with Data Streams + KCL.
EMR when serverless ETL is enough. Standing up and managing EMR clusters is operational overhead; for straightforward Spark ETL and cataloging, Glue is serverless and simpler.
Shard limits. Each Kinesis shard ingests only 1 MB/s or 1,000 records/s; high-volume streams must be sharded enough, and the partition key should distribute load evenly while preserving per-shard ordering.
Core vs task nodes on Spot. Core nodes store HDFS data, so losing one risks data; run cost-saving Spot capacity on stateless task nodes instead.

AWS Solutions Architect Associate

AWS Solutions Architect

3.7 Data Ingestion and Transformation - Kinesis, Glue, and EMR

Key Takeaways

Amazon Kinesis

Kinesis Data Streams details

Kinesis Data Firehose details

AWS Glue

Amazon EMR (Elastic MapReduce)

EMR vs Glue

Worked Scenario: Choosing Along the Pipeline

Common Traps to Avoid

AWS Solutions Architect Associate

1Introduction

2Domain 1: Design Secure Architectures (30%)

3Domain 2: Design Resilient Architectures (26%)

4Domain 3: Design High-Performing Architectures (24%)

5Domain 4: Design Cost-Optimized Architectures (20%)

6VPC and Networking Deep Dive

7Migration, Transfer, and Hybrid Services

8Serverless Architecture and Application Services

9Advanced Topics and Exam Scenarios

AWS Solutions Architect

3.7 Data Ingestion and Transformation - Kinesis, Glue, and EMR

Key Takeaways

Amazon Kinesis

Kinesis Data Streams details

Kinesis Data Firehose details

AWS Glue

Amazon EMR (Elastic MapReduce)

EMR vs Glue

Worked Scenario: Choosing Along the Pipeline

Common Traps to Avoid