3.4 Incremental Data Processing Patterns

Key Takeaways

  • Incremental processing only handles new or changed data since the last run, avoiding expensive full-table reprocessing.
  • Change Data Capture (CDC) tracks inserts, updates, and deletes at the source and replays them on the target using MERGE INTO.
  • Streaming reads from Delta tables automatically track progress using checkpoints, processing only new appends.
  • The CHANGES clause (CDF - Change Data Feed) reads row-level changes from a Delta table including the operation type.
  • Watermarking in Structured Streaming handles late-arriving data by defining how long to wait for delayed records.
Last updated: March 2026

Incremental Data Processing Patterns

Quick Answer: Incremental processing avoids reprocessing the entire dataset by tracking what has changed. Use streaming reads from Delta for append-only sources, Change Data Feed (CDF) for tracking row-level changes, and MERGE INTO for applying CDC events to target tables.

Why Incremental Processing?

ApproachProcessing VolumeCostLatency
Full refreshEntire dataset every runHighHigh
IncrementalOnly new/changed dataLowLow

Incremental processing is essential for:

  • Large tables where full reprocessing is prohibitively expensive
  • Near-real-time data freshness requirements
  • Cost optimization (less compute = lower cost)
  • Streaming pipelines that must process data continuously

Change Data Capture (CDC)

CDC captures row-level changes (INSERT, UPDATE, DELETE) from source systems:

Applying CDC with MERGE INTO

-- CDC events have an operation column: INSERT, UPDATE, DELETE
MERGE INTO silver.customers AS target
USING bronze.customer_cdc AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source.operation = 'DELETE' THEN DELETE
WHEN MATCHED AND source.operation = 'UPDATE' THEN UPDATE SET *
WHEN NOT MATCHED AND source.operation = 'INSERT' THEN INSERT *;

Handling Out-of-Order CDC Events

-- Deduplicate and keep only the latest change per key
MERGE INTO silver.customers AS target
USING (
    SELECT * FROM (
        SELECT *, ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY event_timestamp DESC
        ) AS rn
        FROM bronze.customer_cdc
        WHERE event_timestamp > (SELECT MAX(last_processed) FROM watermarks)
    ) WHERE rn = 1
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Change Data Feed (CDF)

Change Data Feed enables reading row-level changes from a Delta table:

Enable CDF on a Table

-- Enable CDF on an existing table
ALTER TABLE my_catalog.my_schema.orders
SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

-- Or create with CDF enabled
CREATE TABLE my_catalog.my_schema.orders (
    order_id BIGINT,
    amount DECIMAL(10,2)
) TBLPROPERTIES (delta.enableChangeDataFeed = true);

Read Changes from CDF

-- Read changes between versions
SELECT * FROM table_changes('my_catalog.my_schema.orders', 5, 10);

-- Read changes since a timestamp
SELECT * FROM table_changes('my_catalog.my_schema.orders', '2026-03-01');
# Read CDF as a streaming source
changes_df = (spark.readStream
    .format("delta")
    .option("readChangeFeed", "true")
    .option("startingVersion", 5)
    .table("my_catalog.my_schema.orders")
)

CDF Metadata Columns

ColumnDescriptionValues
_change_typeType of changeinsert, update_preimage, update_postimage, delete
_commit_versionDelta version of the changeInteger
_commit_timestampTimestamp of the changeTimestamp

On the Exam: CDF adds three metadata columns to each change record. The key values for _change_type are: insert (new row), update_preimage (old value before update), update_postimage (new value after update), and delete (removed row).

Streaming from Delta Tables

# Stream new appends from a Delta table
stream_df = (spark.readStream
    .format("delta")
    .option("maxFilesPerTrigger", 100)
    .table("my_catalog.my_schema.bronze_events")
)

# Process and write to silver
(stream_df
    .filter("event_type IS NOT NULL")
    .writeStream
    .option("checkpointLocation", "/checkpoints/silver/events")
    .trigger(availableNow=True)
    .toTable("my_catalog.my_schema.silver_events")
)

Watermarking for Late Data

Watermarking tells Structured Streaming how long to wait for late-arriving data:

# Accept events up to 10 minutes late
(spark.readStream
    .table("events")
    .withWatermark("event_time", "10 minutes")
    .groupBy(
        window("event_time", "5 minutes"),
        "device_id"
    )
    .agg(count("*").alias("event_count"))
    .writeStream
    .outputMode("append")
    .option("checkpointLocation", "/checkpoints/windowed")
    .toTable("event_counts")
)
  • Events arriving within the watermark window are included in the aggregation
  • Events arriving after the watermark window are dropped
  • The watermark is based on the maximum event time seen so far minus the threshold
Test Your Knowledge

Which Change Data Feed (CDF) _change_type value represents the old value of a row before an UPDATE operation?

A
B
C
D
Test Your Knowledge

A data engineer sets a watermark of "10 minutes" on an event_time column. An event with event_time 9:05 AM arrives when the maximum event_time seen so far is 9:20 AM. What happens to this event?

A
B
C
D
Test Your Knowledge

What must be enabled on a Delta table before you can read row-level changes using the Change Data Feed?

A
B
C
D