Why does Databricks recommend Trigger.AvailableNow over the deprecated Trigger.Once for incremental batch processing?

Trigger.Once processes only one micro-batch which may not consume all available data, while Trigger.AvailableNow processes all available data in multiple batches. Trigger.Once processes a single micro-batch, which may leave unprocessed data if the backlog is large. Trigger.AvailableNow processes all available data across multiple micro-batches before stopping, ensuring everything is consumed.

What happens if a streaming query's checkpoint location is accidentally deleted?

The stream restarts and reprocesses data from the beginning. Checkpoints track which data has been processed. If deleted, the streaming query loses its progress tracking and will reprocess data from the beginning when restarted. This is why checkpoints should be stored in reliable, durable cloud storage.

A streaming query reads from a Delta table that occasionally has UPDATE operations. The stream fails with an error. What should the engineer do?

Set the ignoreChanges option to true on the streaming read. Delta table streaming sources process only appended data by default. UPDATE and DELETE operations cause errors. Setting .option("ignoreChanges", "true") or .option("ignoreDeletes", "true") allows the stream to skip these operations and continue processing appends.

Structured Streaming Fundamentals

Quick Answer: Structured Streaming models data streams as continuously appending tables. Configure trigger modes for processing frequency, use checkpoints for fault tolerance, and choose output modes (append, complete, update) based on your use case. Trigger.AvailableNow is the recommended mode for incremental batch processing.

The Streaming-as-a-Table Model

Structured Streaming treats a live data feed as an unbounded table where new data is continuously appended:

Input Table: New data from the source is appended as new rows
Query: Your transformation logic runs on the input table
Result Table: The output of the query after processing
Output: Written to a sink (Delta table, console, etc.)

# Basic streaming read and write
stream_df = (spark.readStream
    .format("delta")
    .table("my_catalog.my_schema.source_table")
)

query = (stream_df
    .writeStream
    .format("delta")
    .option("checkpointLocation", "/checkpoints/my_stream")
    .trigger(availableNow=True)
    .toTable("my_catalog.my_schema.target_table")
)

Trigger Modes

Trigger	Behavior	Use Case
Default (unspecified)	Process next micro-batch as soon as previous completes	Low-latency continuous processing
Fixed interval	Process micro-batch every N seconds	Controlled processing rate
AvailableNow	Process all available data, then stop	Scheduled batch processing
Once	Process one micro-batch, then stop	Deprecated — use AvailableNow instead

# Default: process continuously
.trigger()

# Fixed interval: every 30 seconds
.trigger(processingTime="30 seconds")

# AvailableNow: process all pending data, then stop
.trigger(availableNow=True)

# Once (deprecated): process one batch, then stop
.trigger(once=True)  # Deprecated — use availableNow instead

Why AvailableNow Over Once?

Trigger.Once processes only a single micro-batch, which may not consume all available data if there is a large backlog
Trigger.AvailableNow processes ALL available data in multiple micro-batches, ensuring everything is consumed before stopping
Databricks recommends AvailableNow for all incremental batch workloads

Checkpointing

Checkpoints store the state of a streaming query for fault tolerance:

# Checkpoint location is required for production streams
(stream_df.writeStream
    .option("checkpointLocation", "/checkpoints/orders_stream")
    .toTable("my_catalog.my_schema.orders")
)

What Checkpoints Store

Offsets: Which data has been processed (file positions, Kafka offsets)
State: Aggregation state for stateful operations (e.g., running counts)
Commit log: Which micro-batches have been committed to the sink

Checkpoint Rules

Each streaming query must have a unique checkpoint location
Moving or deleting checkpoints causes data to be reprocessed from the beginning
Checkpoints enable exactly-once processing guarantees
Store checkpoints in reliable cloud storage (same bucket/container as your data)

Output Modes

Mode	What Is Written	Requirements
Append	Only new rows added since the last trigger	No aggregations, or aggregations with watermark
Complete	The entire result table after each trigger	Requires aggregation
Update	Only rows that changed since the last trigger	Requires aggregation

# Append mode (default): write only new rows
.outputMode("append")

# Complete mode: write the full result table
.outputMode("complete")

# Update mode: write only changed rows
.outputMode("update")

On the Exam: Know that append mode is the default and most common. Complete mode is used with aggregations when you need the full result. AvailableNow is the recommended trigger for scheduled batch-style streaming workloads.

Streaming from Delta Tables

Delta tables can be used as both streaming sources and sinks:

# Read a Delta table as a stream
stream_df = spark.readStream.table("my_catalog.my_schema.source")

# Write stream output to a Delta table
(stream_df.writeStream
    .option("checkpointLocation", "/checkpoints/stream1")
    .trigger(availableNow=True)
    .toTable("my_catalog.my_schema.target")
)

Streaming Source Options for Delta

# Limit the rate of data read
.option("maxFilesPerTrigger", "100")
.option("maxBytesPerTrigger", "10g")

# Start from a specific version
.option("startingVersion", "5")

# Ignore deletes and updates in the source (append-only read)
.option("ignoreDeletes", "true")
.option("ignoreChanges", "true")

On the Exam: A Delta table used as a streaming source processes only appended data by default. If the source table has DELETE or UPDATE operations, the stream will fail unless you set ignoreDeletes or ignoreChanges to true.

Databricks Certified Data Engineer Associate

2.7 Structured Streaming Fundamentals

Key Takeaways

Structured Streaming Fundamentals

The Streaming-as-a-Table Model

Trigger Modes

Why AvailableNow Over Once?

Checkpointing

What Checkpoints Store

Checkpoint Rules

Output Modes

Streaming from Delta Tables

Streaming Source Options for Delta

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

2.7 Structured Streaming Fundamentals

Key Takeaways

Structured Streaming Fundamentals

The Streaming-as-a-Table Model

Trigger Modes

Why AvailableNow Over Once?

Checkpointing

What Checkpoints Store

Checkpoint Rules

Output Modes

Streaming from Delta Tables

Streaming Source Options for Delta