3.4 Incremental Data Processing Patterns
Key Takeaways
- Incremental processing only handles new or changed data since the last run, avoiding expensive full-table reprocessing.
- Change Data Capture (CDC) tracks inserts, updates, and deletes at the source and replays them on the target using MERGE INTO.
- Streaming reads from Delta tables automatically track progress using checkpoints, processing only new appends.
- The CHANGES clause (CDF - Change Data Feed) reads row-level changes from a Delta table including the operation type.
- Watermarking in Structured Streaming handles late-arriving data by defining how long to wait for delayed records.
Incremental Data Processing Patterns
Quick Answer: Incremental processing avoids reprocessing the entire dataset by tracking what has changed. Use streaming reads from Delta for append-only sources, Change Data Feed (CDF) for tracking row-level changes, and MERGE INTO for applying CDC events to target tables.
Why Incremental Processing?
| Approach | Processing Volume | Cost | Latency |
|---|---|---|---|
| Full refresh | Entire dataset every run | High | High |
| Incremental | Only new/changed data | Low | Low |
Incremental processing is essential for:
- Large tables where full reprocessing is prohibitively expensive
- Near-real-time data freshness requirements
- Cost optimization (less compute = lower cost)
- Streaming pipelines that must process data continuously
Change Data Capture (CDC)
CDC captures row-level changes (INSERT, UPDATE, DELETE) from source systems:
Applying CDC with MERGE INTO
-- CDC events have an operation column: INSERT, UPDATE, DELETE
MERGE INTO silver.customers AS target
USING bronze.customer_cdc AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source.operation = 'DELETE' THEN DELETE
WHEN MATCHED AND source.operation = 'UPDATE' THEN UPDATE SET *
WHEN NOT MATCHED AND source.operation = 'INSERT' THEN INSERT *;
Handling Out-of-Order CDC Events
-- Deduplicate and keep only the latest change per key
MERGE INTO silver.customers AS target
USING (
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY event_timestamp DESC
) AS rn
FROM bronze.customer_cdc
WHERE event_timestamp > (SELECT MAX(last_processed) FROM watermarks)
) WHERE rn = 1
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
Change Data Feed (CDF)
Change Data Feed enables reading row-level changes from a Delta table:
Enable CDF on a Table
-- Enable CDF on an existing table
ALTER TABLE my_catalog.my_schema.orders
SET TBLPROPERTIES (delta.enableChangeDataFeed = true);
-- Or create with CDF enabled
CREATE TABLE my_catalog.my_schema.orders (
order_id BIGINT,
amount DECIMAL(10,2)
) TBLPROPERTIES (delta.enableChangeDataFeed = true);
Read Changes from CDF
-- Read changes between versions
SELECT * FROM table_changes('my_catalog.my_schema.orders', 5, 10);
-- Read changes since a timestamp
SELECT * FROM table_changes('my_catalog.my_schema.orders', '2026-03-01');
# Read CDF as a streaming source
changes_df = (spark.readStream
.format("delta")
.option("readChangeFeed", "true")
.option("startingVersion", 5)
.table("my_catalog.my_schema.orders")
)
CDF Metadata Columns
| Column | Description | Values |
|---|---|---|
| _change_type | Type of change | insert, update_preimage, update_postimage, delete |
| _commit_version | Delta version of the change | Integer |
| _commit_timestamp | Timestamp of the change | Timestamp |
On the Exam: CDF adds three metadata columns to each change record. The key values for _change_type are: insert (new row), update_preimage (old value before update), update_postimage (new value after update), and delete (removed row).
Streaming from Delta Tables
# Stream new appends from a Delta table
stream_df = (spark.readStream
.format("delta")
.option("maxFilesPerTrigger", 100)
.table("my_catalog.my_schema.bronze_events")
)
# Process and write to silver
(stream_df
.filter("event_type IS NOT NULL")
.writeStream
.option("checkpointLocation", "/checkpoints/silver/events")
.trigger(availableNow=True)
.toTable("my_catalog.my_schema.silver_events")
)
Watermarking for Late Data
Watermarking tells Structured Streaming how long to wait for late-arriving data:
# Accept events up to 10 minutes late
(spark.readStream
.table("events")
.withWatermark("event_time", "10 minutes")
.groupBy(
window("event_time", "5 minutes"),
"device_id"
)
.agg(count("*").alias("event_count"))
.writeStream
.outputMode("append")
.option("checkpointLocation", "/checkpoints/windowed")
.toTable("event_counts")
)
- Events arriving within the watermark window are included in the aggregation
- Events arriving after the watermark window are dropped
- The watermark is based on the maximum event time seen so far minus the threshold
Which Change Data Feed (CDF) _change_type value represents the old value of a row before an UPDATE operation?
A data engineer sets a watermark of "10 minutes" on an event_time column. An event with event_time 9:05 AM arrives when the maximum event_time seen so far is 9:20 AM. What happens to this event?
What must be enabled on a Delta table before you can read row-level changes using the Change Data Feed?