2.7 Structured Streaming Fundamentals

Key Takeaways

  • Structured Streaming treats a stream as an unbounded table; the same DataFrame API serves batch and streaming via spark.readStream / writeStream.
  • The checkpointLocation persists offsets and state, giving exactly-once fault tolerance across restarts.
  • Triggers control batch cadence: default (ASAP), processingTime (fixed interval), and availableNow (process all then stop); Trigger.Once is deprecated in favor of availableNow.
  • Output modes are append (new rows only), complete (full result every batch, for aggregations), and update (changed rows only).
  • Watermarks (withWatermark) bound state for event-time aggregations by defining how late data may arrive before being dropped.
Last updated: June 2026

The Unbounded Table Model

Structured Streaming is Spark's scalable, fault-tolerant stream-processing engine. Its central abstraction: a data stream is an unbounded table that grows as new records arrive. You express computation with the same DataFrame/SQL API as batch, and Spark incrementally runs it as a series of small batches (micro-batches). You read with spark.readStream and write with df.writeStream:

stream = (spark.readStream.table("bronze.events")
  .groupBy("type").count())

(stream.writeStream
  .option("checkpointLocation", "/chk/agg")
  .outputMode("complete")
  .trigger(processingTime="30 seconds")
  .toTable("gold.event_counts"))

The checkpointLocation is mandatory for fault tolerance: it stores the stream's progress (source offsets) and any operator state. If the job fails or restarts, it resumes from the last committed offset, delivering exactly-once end-to-end guarantees when paired with replayable sources and idempotent sinks like Delta. Each independent stream needs its own checkpoint directory.

Triggers

The trigger controls how often a streaming query executes a micro-batch:

TriggerBehavior
Default (unspecified)Process as soon as the previous batch finishes
processingTime="10 seconds"Fire at a fixed wall-clock interval
availableNow=TrueProcess all available data, then stop
once=True (deprecated)Single batch then stop

Trigger.AvailableNow is the recommended choice for incremental batch jobs: it consumes every unprocessed record across multiple internal batches and shuts the query down, which is ideal for a scheduled workflow that wakes, ingests new files, and exits. Trigger.Once is deprecated in Databricks Runtime 11.3 LTS and above — use availableNow instead because Once does only a single batch and may not honor maxFilesPerTrigger limits.

Output Modes and Watermarks

The output mode declares which rows are written to the sink each trigger:

ModeWritesTypical use
appendOnly new rows that will not changeNon-aggregated streams; default
completeThe entire result table every batchAggregations (small result sets)
updateOnly rows changed since last batchAggregations with incremental sinks

For event-time aggregations, state can grow unbounded because late records could in theory arrive at any time. A watermark (withWatermark("eventTime", "10 minutes")) tells Spark the maximum lateness to tolerate: records later than the watermark are dropped, and the state for closed windows is cleaned up. This bounds memory and lets append mode emit finalized aggregation results once their window can no longer change.

(events.withWatermark("ts", "10 minutes")
  .groupBy(window("ts", "5 minutes")).count())

Sources, Sinks, and Streaming Limitations

Structured Streaming reads from and writes to a range of systems. On Databricks the most common pairing is a Delta table as both source and sink:

SourcesSinks
Delta tables, Auto Loader (cloudFiles)Delta tables (toTable)
Apache Kafka, Event HubsKafka
Rate source (testing)Console / memory (testing)
Files (json/csv/parquet)foreachBatch (arbitrary sink)

Reading a Delta table with readStream turns every new commit into streaming input — the foundation of the medallion bronze→silver→gold pattern, where each layer streams from the one below. The foreachBatch sink is the escape hatch for operations the streaming engine does not natively support: each micro-batch is handed to you as a normal DataFrame, so you can MERGE, write to multiple tables, or call external APIs.

Not every batch operation is legal in a stream. Multiple aggregations, certain non-time-based joins, and count()/show() actions are restricted; you also cannot sort a streaming DataFrame except after an aggregation in complete mode. Append output mode cannot be used with aggregations unless a watermark lets Spark finalize windows. Understanding these constraints — and that a checkpoint ties a query to its exact logical plan (changing the plan can invalidate the checkpoint) — is core to operating streams reliably.

For most certification scenarios, the pattern to remember is: readStream a Delta/Auto Loader source, transform with the standard API, set a checkpointLocation, pick availableNow or processingTime, choose the matching output mode, and write with toTable.

Managing and Monitoring a Streaming Query

A writeStream call returns a StreamingQuery handle. With it you control the lifecycle: query.awaitTermination() blocks until the stream stops, query.stop() shuts it down gracefully, and query.status / query.recentProgress expose throughput and latency metrics. In a notebook, a streaming cell stays 'running' and renders a live dashboard of input rows per second and batch duration — useful for spotting backpressure.

Naming a stream with .queryName("events") makes it identifiable in the Spark UI's Structured Streaming tab. Key health signals are input rate vs processing rate: if input consistently exceeds processing, the stream is falling behind and you should add resources, raise maxFilesPerTrigger's headroom, or scale the cluster.

Reliability rests on a few rules the exam reinforces:

  • One dedicated checkpoint per query — never share a checkpoint between two streams.
  • Use an idempotent, replayable source/sink pair (Delta + Auto Loader/Kafka) for true exactly-once.
  • Changing the query's logical plan can invalidate the checkpoint; plan schema changes carefully.
  • Prefer Trigger.AvailableNow for scheduled incremental jobs and processingTime for always-on low-latency streams.

Following these turns a streaming query into a dependable, restartable pipeline rather than a fragile long-running process. Finally, remember that the exact same transformation code can run as a batch job by swapping readStream/writeStream for read/write — the unified API is a core selling point, and the certification expects you to recognize that streaming on Databricks is just batch executed incrementally over an unbounded table with checkpoint-backed state.

Test Your Knowledge

Which trigger processes all currently available data across multiple internal batches and then stops, and is the recommended replacement for the deprecated Trigger.Once?

A
B
C
D
Test Your Knowledge

A streaming query computes a running count per category and writes the entire aggregated result table every batch. Which output mode is this?

A
B
C
D
Test Your Knowledge

What is the purpose of withWatermark on an event-time streaming aggregation?

A
B
C
D