2.5 Batch vs Stream Processing
Key Takeaways
- Batch processing handles large volumes of data on a schedule and tolerates minutes-to-hours of latency.
- Stream processing handles events one at a time or in micro-batches with sub-second to seconds latency.
- Micro-batch processing groups events into small batches (often a few seconds wide) to balance throughput and latency.
- Stream analytics queries aggregate events over time windows: tumbling (fixed, non-overlapping), hopping (fixed, overlapping), sliding (only on events), and session (gap-defined).
- Azure Stream Analytics, Event Hubs, IoT Hub, and Microsoft Fabric Real-Time Intelligence are the core streaming services; Azure Synapse, Data Factory, and Fabric Data Factory dominate batch.
Every analytics platform has to decide when to process data: in scheduled bulk runs (batch) or as each event arrives (stream). Microsoft Learn treats this as one of the central DP-900 distinctions because it drives service choice.
Batch Processing
Batch processing collects data into a group and processes it as a single job. The classic example is a nightly ETL that loads yesterday's sales into the warehouse at 2 AM.
Batch Characteristics
- Large volume per run — gigabytes to terabytes.
- Scheduled or triggered, not continuous.
- Latency-tolerant — minutes to hours between when data arrives and when it is processed.
- Throughput-optimized — the goal is to move a lot of rows efficiently, not to react to any single row quickly.
- Idempotent reruns — failed jobs are rerun on the same input.
Azure Services for Batch
| Service | Role |
|---|---|
| Azure Data Factory / Fabric Data Factory | Orchestrates pipelines, copy activities, mapping data flows |
| Azure Synapse Analytics | Runs T-SQL or Spark batch jobs against warehouse and lake data |
| Microsoft Fabric (Warehouse, Lakehouse, Notebooks) | Unified storage and compute for batch ELT |
| Azure Databricks | Spark-based batch processing for large lake transformations |
| Azure HDInsight | Managed Hadoop/Spark for legacy batch workloads |
Stream Processing
Stream processing treats each event as it arrives and produces results continuously. The classic example is a fraud detection system that scores every credit-card swipe in under a second.
Stream Characteristics
- Unbounded — data has no defined end.
- Event-driven — work is triggered by the arrival of new data, not by a clock.
- Low latency — milliseconds to seconds end to end.
- Per-event or windowed — operations either apply to single events or to events grouped by a time window.
- State management — engines maintain running aggregates across windows.
Azure Services for Stream
| Service | Role |
|---|---|
| Azure Event Hubs | Big-data event ingestion broker; the typical front door for streams |
| Azure IoT Hub | Bi-directional event ingestion specifically for IoT devices |
| Azure Stream Analytics | SQL-style streaming query engine over Event Hubs / IoT Hub |
| Microsoft Fabric Real-Time Intelligence (Eventstream / Eventhouse / KQL) | Unified streaming ingest, processing, and analytics inside Fabric |
| Azure Databricks Structured Streaming | Spark-based micro-batch and continuous streams |
Micro-Batch Processing
Micro-batch processing sits between batch and true streaming. The engine collects events for a short interval — often a few seconds — and processes that small batch as a unit. Spark Structured Streaming, including the version in Azure Databricks and Microsoft Fabric, runs as micro-batches by default. It gets most of the latency of streaming with much of the throughput and simpler exactly-once semantics of batch.
Time Windows in Stream Analytics
Streams are unbounded, so almost every useful aggregation is computed per window. Azure Stream Analytics and Fabric Real-Time Intelligence support four window types you should recognize for DP-900.
| Window | Shape | Example use |
|---|---|---|
| Tumbling | Fixed length, non-overlapping, contiguous | Count purchases per 5-minute bucket |
| Hopping | Fixed length, overlapping by a hop interval | 5-minute average that updates every 1 minute |
| Sliding | Fixed length, evaluated only when an event arrives in the window | Trigger only when ≥3 alerts arrive in any 30-second period |
| Session | Variable length, defined by gaps of inactivity | Group clicks by a user until they are idle for 10 minutes |
Quick Mental Picture
- Tumbling is a row of train cars — no gaps, no overlap.
- Hopping is overlapping lanes on a highway — every event lands in several windows at once.
- Sliding is a tripwire — the window only fires when something crosses it.
- Session is a movie theater — it lasts as long as the audience keeps clapping; once they stop for long enough, the window closes.
When to Combine Batch and Stream
Many real systems use both:
- Stream path powers live dashboards, alerting, and anomaly detection.
- Batch path rebuilds the historical record nightly, often with corrections, late-arriving data, and richer joins.
Microsoft Fabric Real-Time Intelligence is designed for exactly this pattern: events flow through Eventstreams, land in an Eventhouse for low-latency KQL queries, and can be persisted to OneLake for downstream batch and Power BI consumption.
For DP-900, the test is usually simpler: read the scenario, find the latency requirement ("within seconds" vs "by tomorrow morning"), and pick a service from the matching column.
Latency, Throughput, and the Core Trade-Off
Batch and stream sit at opposite ends of a latency-versus-completeness trade-off. Batch waits until it has a complete set of data, then processes it efficiently in one pass — high throughput, high latency, easy correctness. Streaming acts the instant an event arrives — low latency, but it must cope with out-of-order and late-arriving events because the network does not deliver events in perfect order.
This is why streaming engines use watermarks (a marker that says "I have probably seen all events up to time T") to decide when a time window is safe to close. You will not be asked to configure watermarks, but recognizing that streaming trades completeness for immediacy is squarely on the exam.
Event Time vs Processing Time
A subtle but testable streaming concept: event time is when the event actually happened (stamped at the source), while processing time is when the engine handled it. A sensor reading generated at 12:00:00 might not reach the cloud until 12:00:07 because of a network hiccup. Windowed aggregations should use event time so a late reading still counts in the correct minute, not the minute it happened to arrive. This distinction explains why windowing functions exist and why late data must be handled.
Choosing a Service From the Scenario
| Scenario clue | Pattern | Azure service |
|---|---|---|
| "Every night," "once a day," "scheduled load" | Batch | Data Factory / Fabric Data Factory, Synapse |
| "Within seconds," "real time," "as it happens" | Stream | Event Hubs + Stream Analytics, Fabric RTI |
| "Per device," "telemetry," "sensors" | Stream ingest | IoT Hub |
| "Group events into a few-second batch" | Micro-batch | Spark Structured Streaming (Databricks/Fabric) |
Why Many Systems Use Both (Lambda/Kappa)
The lambda architecture runs a fast speed (hot) layer for immediate, approximate results and a slower batch (cold) layer for complete, corrected history, then serves a merged view. The kappa architecture simplifies this by treating everything as a stream and replaying the event log for reprocessing. For DP-900 the key takeaway is recognizing the dual-path idea: the same Event Hubs ingestion endpoint can feed a real-time dashboard and land raw events in the lake for a nightly, full-fidelity rebuild — which is exactly what Microsoft Fabric Real-Time Intelligence is designed to support.
A logistics company needs to calculate the number of package scans per warehouse in non-overlapping 1-minute buckets, with each scan event belonging to exactly one bucket. Which Azure Stream Analytics window type fits this requirement?
A wind farm needs to score turbine telemetry within two seconds of each reading to detect anomalies, while a separate nightly job rebuilds historical turbine performance tables in a Fabric lakehouse. Which combination of Azure services best fits this scenario?