Which Change Data Feed (CDF) _change_type value represents the old value of a row before an UPDATE operation?

update_preimage. CDF records UPDATE operations as two rows: update_preimage (the old values before the update) and update_postimage (the new values after the update). This allows downstream consumers to see exactly what changed.

A data engineer sets a watermark of "10 minutes" on an event_time column. An event with event_time 9:05 AM arrives when the maximum event_time seen so far is 9:20 AM. What happens to this event?

It is dropped because 9:05 AM is 15 minutes behind the max event_time of 9:20 AM, exceeding the 10-minute watermark. The watermark threshold is max event_time - watermark duration = 9:20 - 10 minutes = 9:10 AM. Events with event_time before 9:10 AM are considered too late and are dropped. The event at 9:05 AM is 15 minutes late, exceeding the 10-minute watermark.

What must be enabled on a Delta table before you can read row-level changes using the Change Data Feed?

The delta.enableChangeDataFeed table property must be set to true. Change Data Feed must be explicitly enabled by setting the delta.enableChangeDataFeed table property to true. This can be done when creating the table or by altering an existing table. Once enabled, CDF records are available for both batch and streaming reads.

Incremental Data Processing Patterns

Quick Answer: Incremental processing avoids reprocessing the entire dataset by tracking what has changed. Use streaming reads from Delta for append-only sources, Change Data Feed (CDF) for tracking row-level changes, and MERGE INTO for applying CDC events to target tables.

Why Incremental Processing?

Approach	Processing Volume	Cost	Latency
Full refresh	Entire dataset every run	High	High
Incremental	Only new/changed data	Low	Low

Incremental processing is essential for:

Large tables where full reprocessing is prohibitively expensive
Near-real-time data freshness requirements
Cost optimization (less compute = lower cost)
Streaming pipelines that must process data continuously

Change Data Capture (CDC)

CDC captures row-level changes (INSERT, UPDATE, DELETE) from source systems:

Applying CDC with MERGE INTO

-- CDC events have an operation column: INSERT, UPDATE, DELETE
MERGE INTO silver.customers AS target
USING bronze.customer_cdc AS source
ON target.customer_id = source.customer_id
WHEN MATCHED AND source.operation = 'DELETE' THEN DELETE
WHEN MATCHED AND source.operation = 'UPDATE' THEN UPDATE SET *
WHEN NOT MATCHED AND source.operation = 'INSERT' THEN INSERT *;

Handling Out-of-Order CDC Events

-- Deduplicate and keep only the latest change per key
MERGE INTO silver.customers AS target
USING (
    SELECT * FROM (
        SELECT *, ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY event_timestamp DESC
        ) AS rn
        FROM bronze.customer_cdc
        WHERE event_timestamp > (SELECT MAX(last_processed) FROM watermarks)
    ) WHERE rn = 1
) AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Change Data Feed (CDF)

Change Data Feed enables reading row-level changes from a Delta table:

Enable CDF on a Table

-- Enable CDF on an existing table
ALTER TABLE my_catalog.my_schema.orders
SET TBLPROPERTIES (delta.enableChangeDataFeed = true);

-- Or create with CDF enabled
CREATE TABLE my_catalog.my_schema.orders (
    order_id BIGINT,
    amount DECIMAL(10,2)
) TBLPROPERTIES (delta.enableChangeDataFeed = true);

Read Changes from CDF

-- Read changes between versions
SELECT * FROM table_changes('my_catalog.my_schema.orders', 5, 10);

-- Read changes since a timestamp
SELECT * FROM table_changes('my_catalog.my_schema.orders', '2026-03-01');

# Read CDF as a streaming source
changes_df = (spark.readStream
    .format("delta")
    .option("readChangeFeed", "true")
    .option("startingVersion", 5)
    .table("my_catalog.my_schema.orders")
)

CDF Metadata Columns

Column	Description	Values
_change_type	Type of change	insert, update_preimage, update_postimage, delete
_commit_version	Delta version of the change	Integer
_commit_timestamp	Timestamp of the change	Timestamp

On the Exam: CDF adds three metadata columns to each change record. The key values for _change_type are: insert (new row), update_preimage (old value before update), update_postimage (new value after update), and delete (removed row).

Streaming from Delta Tables

# Stream new appends from a Delta table
stream_df = (spark.readStream
    .format("delta")
    .option("maxFilesPerTrigger", 100)
    .table("my_catalog.my_schema.bronze_events")
)

# Process and write to silver
(stream_df
    .filter("event_type IS NOT NULL")
    .writeStream
    .option("checkpointLocation", "/checkpoints/silver/events")
    .trigger(availableNow=True)
    .toTable("my_catalog.my_schema.silver_events")
)

Watermarking for Late Data

Watermarking tells Structured Streaming how long to wait for late-arriving data:

# Accept events up to 10 minutes late
(spark.readStream
    .table("events")
    .withWatermark("event_time", "10 minutes")
    .groupBy(
        window("event_time", "5 minutes"),
        "device_id"
    )
    .agg(count("*").alias("event_count"))
    .writeStream
    .outputMode("append")
    .option("checkpointLocation", "/checkpoints/windowed")
    .toTable("event_counts")
)

Events arriving within the watermark window are included in the aggregation
Events arriving after the watermark window are dropped
The watermark is based on the maximum event time seen so far minus the threshold

Databricks Certified Data Engineer Associate

3.4 Incremental Data Processing Patterns

Key Takeaways

Incremental Data Processing Patterns

Why Incremental Processing?

Change Data Capture (CDC)

Applying CDC with MERGE INTO

Handling Out-of-Order CDC Events

Change Data Feed (CDF)

Enable CDF on a Table

Read Changes from CDF

CDF Metadata Columns

Streaming from Delta Tables

Watermarking for Late Data

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

3.4 Incremental Data Processing Patterns

Key Takeaways

Incremental Data Processing Patterns

Why Incremental Processing?

Change Data Capture (CDC)

Applying CDC with MERGE INTO

Handling Out-of-Order CDC Events

Change Data Feed (CDF)

Enable CDF on a Table

Read Changes from CDF

CDF Metadata Columns

Streaming from Delta Tables

Watermarking for Late Data