2.7 Structured Streaming Fundamentals

Key Takeaways

  • Structured Streaming treats a data stream as an unbounded table that is continuously appended with new data.
  • Trigger modes control processing frequency: continuous (near real-time), fixed interval (micro-batch), availableNow (batch all pending), and once (deprecated).
  • Checkpointing stores streaming progress (offsets, state) to enable fault-tolerant exactly-once processing.
  • Output modes determine what data is written: append (new rows only), complete (full result), and update (changed rows only).
  • Trigger.AvailableNow replaces the deprecated Trigger.Once and processes all available data as an incremental batch.
Last updated: March 2026

Structured Streaming Fundamentals

Quick Answer: Structured Streaming models data streams as continuously appending tables. Configure trigger modes for processing frequency, use checkpoints for fault tolerance, and choose output modes (append, complete, update) based on your use case. Trigger.AvailableNow is the recommended mode for incremental batch processing.

The Streaming-as-a-Table Model

Structured Streaming treats a live data feed as an unbounded table where new data is continuously appended:

  • Input Table: New data from the source is appended as new rows
  • Query: Your transformation logic runs on the input table
  • Result Table: The output of the query after processing
  • Output: Written to a sink (Delta table, console, etc.)
# Basic streaming read and write
stream_df = (spark.readStream
    .format("delta")
    .table("my_catalog.my_schema.source_table")
)

query = (stream_df
    .writeStream
    .format("delta")
    .option("checkpointLocation", "/checkpoints/my_stream")
    .trigger(availableNow=True)
    .toTable("my_catalog.my_schema.target_table")
)

Trigger Modes

TriggerBehaviorUse Case
Default (unspecified)Process next micro-batch as soon as previous completesLow-latency continuous processing
Fixed intervalProcess micro-batch every N secondsControlled processing rate
AvailableNowProcess all available data, then stopScheduled batch processing
OnceProcess one micro-batch, then stopDeprecated — use AvailableNow instead
# Default: process continuously
.trigger()

# Fixed interval: every 30 seconds
.trigger(processingTime="30 seconds")

# AvailableNow: process all pending data, then stop
.trigger(availableNow=True)

# Once (deprecated): process one batch, then stop
.trigger(once=True)  # Deprecated — use availableNow instead

Why AvailableNow Over Once?

  • Trigger.Once processes only a single micro-batch, which may not consume all available data if there is a large backlog
  • Trigger.AvailableNow processes ALL available data in multiple micro-batches, ensuring everything is consumed before stopping
  • Databricks recommends AvailableNow for all incremental batch workloads

Checkpointing

Checkpoints store the state of a streaming query for fault tolerance:

# Checkpoint location is required for production streams
(stream_df.writeStream
    .option("checkpointLocation", "/checkpoints/orders_stream")
    .toTable("my_catalog.my_schema.orders")
)

What Checkpoints Store

  • Offsets: Which data has been processed (file positions, Kafka offsets)
  • State: Aggregation state for stateful operations (e.g., running counts)
  • Commit log: Which micro-batches have been committed to the sink

Checkpoint Rules

  • Each streaming query must have a unique checkpoint location
  • Moving or deleting checkpoints causes data to be reprocessed from the beginning
  • Checkpoints enable exactly-once processing guarantees
  • Store checkpoints in reliable cloud storage (same bucket/container as your data)

Output Modes

ModeWhat Is WrittenRequirements
AppendOnly new rows added since the last triggerNo aggregations, or aggregations with watermark
CompleteThe entire result table after each triggerRequires aggregation
UpdateOnly rows that changed since the last triggerRequires aggregation
# Append mode (default): write only new rows
.outputMode("append")

# Complete mode: write the full result table
.outputMode("complete")

# Update mode: write only changed rows
.outputMode("update")

On the Exam: Know that append mode is the default and most common. Complete mode is used with aggregations when you need the full result. AvailableNow is the recommended trigger for scheduled batch-style streaming workloads.

Streaming from Delta Tables

Delta tables can be used as both streaming sources and sinks:

# Read a Delta table as a stream
stream_df = spark.readStream.table("my_catalog.my_schema.source")

# Write stream output to a Delta table
(stream_df.writeStream
    .option("checkpointLocation", "/checkpoints/stream1")
    .trigger(availableNow=True)
    .toTable("my_catalog.my_schema.target")
)

Streaming Source Options for Delta

# Limit the rate of data read
.option("maxFilesPerTrigger", "100")
.option("maxBytesPerTrigger", "10g")

# Start from a specific version
.option("startingVersion", "5")

# Ignore deletes and updates in the source (append-only read)
.option("ignoreDeletes", "true")
.option("ignoreChanges", "true")

On the Exam: A Delta table used as a streaming source processes only appended data by default. If the source table has DELETE or UPDATE operations, the stream will fail unless you set ignoreDeletes or ignoreChanges to true.

Test Your Knowledge

Why does Databricks recommend Trigger.AvailableNow over the deprecated Trigger.Once for incremental batch processing?

A
B
C
D
Test Your Knowledge

What happens if a streaming query's checkpoint location is accidentally deleted?

A
B
C
D
Test Your Knowledge

A streaming query reads from a Delta table that occasionally has UPDATE operations. The stream fails with an error. What should the engineer do?

A
B
C
D