2.7 Structured Streaming Fundamentals
Key Takeaways
- Structured Streaming treats a stream as an unbounded table; the same DataFrame API serves batch and streaming via spark.readStream / writeStream.
- The checkpointLocation persists offsets and state, giving exactly-once fault tolerance across restarts.
- Triggers control batch cadence: default (ASAP), processingTime (fixed interval), and availableNow (process all then stop); Trigger.Once is deprecated in favor of availableNow.
- Output modes are append (new rows only), complete (full result every batch, for aggregations), and update (changed rows only).
- Watermarks (withWatermark) bound state for event-time aggregations by defining how late data may arrive before being dropped.
The Unbounded Table Model
Structured Streaming is Spark's scalable, fault-tolerant stream-processing engine. Its central abstraction: a data stream is an unbounded table that grows as new records arrive. You express computation with the same DataFrame/SQL API as batch, and Spark incrementally runs it as a series of small batches (micro-batches). You read with spark.readStream and write with df.writeStream:
stream = (spark.readStream.table("bronze.events")
.groupBy("type").count())
(stream.writeStream
.option("checkpointLocation", "/chk/agg")
.outputMode("complete")
.trigger(processingTime="30 seconds")
.toTable("gold.event_counts"))
The checkpointLocation is mandatory for fault tolerance: it stores the stream's progress (source offsets) and any operator state. If the job fails or restarts, it resumes from the last committed offset, delivering exactly-once end-to-end guarantees when paired with replayable sources and idempotent sinks like Delta. Each independent stream needs its own checkpoint directory.
Triggers
The trigger controls how often a streaming query executes a micro-batch:
| Trigger | Behavior |
|---|---|
| Default (unspecified) | Process as soon as the previous batch finishes |
processingTime="10 seconds" | Fire at a fixed wall-clock interval |
availableNow=True | Process all available data, then stop |
once=True (deprecated) | Single batch then stop |
Trigger.AvailableNow is the recommended choice for incremental batch jobs: it consumes every unprocessed record across multiple internal batches and shuts the query down, which is ideal for a scheduled workflow that wakes, ingests new files, and exits. Trigger.Once is deprecated in Databricks Runtime 11.3 LTS and above — use availableNow instead because Once does only a single batch and may not honor maxFilesPerTrigger limits.
Output Modes and Watermarks
The output mode declares which rows are written to the sink each trigger:
| Mode | Writes | Typical use |
|---|---|---|
| append | Only new rows that will not change | Non-aggregated streams; default |
| complete | The entire result table every batch | Aggregations (small result sets) |
| update | Only rows changed since last batch | Aggregations with incremental sinks |
For event-time aggregations, state can grow unbounded because late records could in theory arrive at any time. A watermark (withWatermark("eventTime", "10 minutes")) tells Spark the maximum lateness to tolerate: records later than the watermark are dropped, and the state for closed windows is cleaned up. This bounds memory and lets append mode emit finalized aggregation results once their window can no longer change.
(events.withWatermark("ts", "10 minutes")
.groupBy(window("ts", "5 minutes")).count())
Sources, Sinks, and Streaming Limitations
Structured Streaming reads from and writes to a range of systems. On Databricks the most common pairing is a Delta table as both source and sink:
| Sources | Sinks |
|---|---|
| Delta tables, Auto Loader (cloudFiles) | Delta tables (toTable) |
| Apache Kafka, Event Hubs | Kafka |
| Rate source (testing) | Console / memory (testing) |
| Files (json/csv/parquet) | foreachBatch (arbitrary sink) |
Reading a Delta table with readStream turns every new commit into streaming input — the foundation of the medallion bronze→silver→gold pattern, where each layer streams from the one below. The foreachBatch sink is the escape hatch for operations the streaming engine does not natively support: each micro-batch is handed to you as a normal DataFrame, so you can MERGE, write to multiple tables, or call external APIs.
Not every batch operation is legal in a stream. Multiple aggregations, certain non-time-based joins, and count()/show() actions are restricted; you also cannot sort a streaming DataFrame except after an aggregation in complete mode. Append output mode cannot be used with aggregations unless a watermark lets Spark finalize windows. Understanding these constraints — and that a checkpoint ties a query to its exact logical plan (changing the plan can invalidate the checkpoint) — is core to operating streams reliably.
For most certification scenarios, the pattern to remember is: readStream a Delta/Auto Loader source, transform with the standard API, set a checkpointLocation, pick availableNow or processingTime, choose the matching output mode, and write with toTable.
Managing and Monitoring a Streaming Query
A writeStream call returns a StreamingQuery handle. With it you control the lifecycle: query.awaitTermination() blocks until the stream stops, query.stop() shuts it down gracefully, and query.status / query.recentProgress expose throughput and latency metrics. In a notebook, a streaming cell stays 'running' and renders a live dashboard of input rows per second and batch duration — useful for spotting backpressure.
Naming a stream with .queryName("events") makes it identifiable in the Spark UI's Structured Streaming tab. Key health signals are input rate vs processing rate: if input consistently exceeds processing, the stream is falling behind and you should add resources, raise maxFilesPerTrigger's headroom, or scale the cluster.
Reliability rests on a few rules the exam reinforces:
- One dedicated checkpoint per query — never share a checkpoint between two streams.
- Use an idempotent, replayable source/sink pair (Delta + Auto Loader/Kafka) for true exactly-once.
- Changing the query's logical plan can invalidate the checkpoint; plan schema changes carefully.
- Prefer
Trigger.AvailableNowfor scheduled incremental jobs andprocessingTimefor always-on low-latency streams.
Following these turns a streaming query into a dependable, restartable pipeline rather than a fragile long-running process. Finally, remember that the exact same transformation code can run as a batch job by swapping readStream/writeStream for read/write — the unified API is a core selling point, and the certification expects you to recognize that streaming on Databricks is just batch executed incrementally over an unbounded table with checkpoint-backed state.
Which trigger processes all currently available data across multiple internal batches and then stops, and is the recommended replacement for the deprecated Trigger.Once?
A streaming query computes a running count per category and writes the entire aggregated result table every batch. Which output mode is this?
What is the purpose of withWatermark on an event-time streaming aggregation?