3.4 Incremental Data Processing Patterns

Key Takeaways

  • Incremental processing handles only new or changed data since the last run, avoiding costly full-table reprocessing.
  • Structured Streaming on Delta tracks progress in a checkpoint, so each run processes only newly appended data exactly once.
  • Auto Loader (cloudFiles) incrementally ingests new files from cloud storage, tracking discovered files in RocksDB for exactly-once guarantees.
  • Trigger.AvailableNow processes all available data in micro-batches and then stops, giving incremental batch semantics on a streaming source.
  • Idempotent MERGE inside foreachBatch lets streaming pipelines upsert without creating duplicates if a batch is retried.
Last updated: June 2026

Why Incremental Processing

Incremental processing means each pipeline run touches only the data that is new or changed since the previous run, rather than rescanning the entire source. On large tables this is the difference between a job that runs in seconds and one that reprocesses terabytes every cycle. Databricks delivers incremental semantics primarily through Structured Streaming and Auto Loader.

Structured Streaming over Delta

A Delta table can be read as a streaming source. Structured Streaming records how far it has consumed in a checkpoint location; on the next trigger it resumes from there and processes only the newly appended rows.

(spark.readStream.table("bronze_events")
  .writeStream
  .option("checkpointLocation", "/chk/silver_events")
  .trigger(availableNow=True)
  .table("silver_events"))

The checkpoint is critical: it stores the stream's offsets and state. Pointing a stream at a new checkpoint location effectively starts a brand-new stream that reprocesses from the beginning, so checkpoints must be stable and unique per stream. Streaming from Delta works cleanly only for append-only changes; updates and deletes in the source require options like ignoreChanges or a Change Data Feed read.

Auto Loader (cloudFiles)

Auto Loader incrementally and efficiently ingests new files as they land in cloud object storage, exposing a Structured Streaming source called cloudFiles. It tracks which files it has already seen in the checkpoint (backed by RocksDB), giving exactly-once ingestion without you maintaining a manifest of processed files.

(spark.readStream.format("cloudFiles")
  .option("cloudFiles.format", "json")
  .option("cloudFiles.schemaLocation", "/schema/orders")
  .load("/landing/orders")
  .writeStream.option("checkpointLocation", "/chk/orders")
  .trigger(availableNow=True).table("bronze_orders"))

Key behaviors:

  • Schema inference and evolution via cloudFiles.schemaLocation; unexpected columns are captured in _rescued_data instead of failing.
  • By default it processes up to 1000 files per micro-batch; tune with cloudFiles.maxFilesPerTrigger / maxBytesPerTrigger.
  • In SQL, the equivalent ingestion function is read_files (used heavily inside declarative pipelines).

Triggers

  • Trigger.AvailableNow — consume all currently available data in one or more micro-batches, then stop. This gives "incremental batch" behavior: run on a schedule, process only new data, shut down.
  • Default (micro-batch) / processingTime — keep running continuously, processing as data arrives.

Watermarking and Idempotent MERGE

When aggregating or joining streams, watermarking tells the engine how long to wait for late-arriving records before finalizing a window and dropping older state. withWatermark("event_time", "10 minutes") means events more than 10 minutes late may be ignored — bounding the state the stream must keep.

For upserts in a stream, the pattern is idempotent MERGE inside foreachBatch. A streaming write delivers each micro-batch as a small DataFrame; you MERGE it into the target on a key:

def upsert(batch_df, batch_id):
    batch_df.createOrReplaceTempView("updates")
    batch_df.sparkSession.sql("""
      MERGE INTO silver t USING updates s ON t.id = s.id
      WHEN MATCHED THEN UPDATE SET *
      WHEN NOT MATCHED THEN INSERT *""")

stream.writeStream.foreachBatch(upsert).option("checkpointLocation", "/chk").start()

Because MERGE matches on the key, re-running a retried batch produces the same final state — it is idempotent, so failures and replays do not create duplicates.

ConceptPurpose
CheckpointTracks stream progress for exactly-once resume
Auto LoaderIncremental new-file ingestion from cloud storage
Trigger.AvailableNowProcess all available data, then stop
WatermarkBound state and handle late data
Idempotent MERGEUpsert without duplicates on retry

Change Data Feed and CDC Patterns

When the source itself emits inserts, updates, and deletes, you process them with Change Data Capture (CDC). Databricks Delta supports Change Data Feed (CDF), which records row-level changes so downstream consumers read exactly what changed rather than re-scanning the whole table. Enable it per table:

ALTER TABLE silver SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

SELECT * FROM table_changes('silver', 5);  -- changes since version 5

CDF adds metadata columns: _change_type (insert, update_preimage, update_postimage, delete), _commit_version, and _commit_timestamp. A downstream MERGE can then apply only those changed rows to the next layer, which is far cheaper than reprocessing everything.

Full reprocess vs incremental — when each applies

  • Incremental (streaming/CDF/Auto Loader) is the default for large, append-heavy tables — process only new data.
  • Full refresh is still correct when business logic changes (a new cleansing rule must be applied to all history) or when a small dimension is cheap to rebuild.
  • A streaming read with Trigger.AvailableNow gives you the cost profile of batch with the bookkeeping of streaming — the most common production choice for scheduled incremental jobs. Recognizing which scenario calls for incremental versus full reprocessing is a recurring exam theme.

Streaming Semantics You Must Know

Structured Streaming treats a stream as an unbounded table that grows over time, and several behaviors are tested directly:

  • Output modes govern what is written each trigger: append (only new rows; the default for streaming tables), complete (rewrite the whole result, required for some aggregations), and update (only changed rows).
  • Exactly-once processing is guaranteed by the combination of replayable sources (Delta, Auto Loader) and the checkpoint, so a failed micro-batch is reprocessed without duplicating output.
  • Stateful operations (aggregations, stream-stream joins, deduplication) accumulate state across batches; watermarks bound that state so it does not grow forever.
  • maxFilesPerTrigger / maxBytesPerTrigger rate-limit how much each micro-batch ingests, smoothing load spikes.

Append-only constraint and its workarounds

Streaming reads from a Delta table expect the source to only append. If the upstream table receives updates or deletes, a plain streaming read errors. The workarounds are to read the Change Data Feed (which expresses updates and deletes as explicit change rows) or to set ignoreChanges / ignoreDeletes options, accepting their caveats. Picking the correct option for a source that mutates — rather than assuming a simple stream will work — is a frequent and high-value exam distinction.

Test Your Knowledge

What does a Structured Streaming checkpoint location store, and what happens if you point a stream at a brand-new checkpoint?

A
B
C
D
Test Your Knowledge

Which trigger setting processes all currently available data in micro-batches and then stops the stream, giving incremental batch behavior?

A
B
C
D
Test Your Knowledge

Why is running MERGE INTO on a key inside foreachBatch considered idempotent for streaming upserts?

A
B
C
D