3.7 Slowly Changing Dimensions (SCD) Patterns

Key Takeaways

  • Slowly Changing Dimensions track how reference/dimension data (addresses, prices, statuses) changes over time.
  • SCD Type 1 overwrites the old value, keeping only the current state and losing history.
  • SCD Type 2 preserves history by adding a new row per change with __START_AT / __END_AT effective-date columns and a current flag.
  • AUTO CDC INTO (the renamed APPLY CHANGES INTO) computes SCD Type 1 or Type 2 automatically with STORED AS SCD TYPE 1/2 and SEQUENCE BY.
  • SEQUENCE BY handles out-of-order CDC events so changes are applied in the correct logical order.
Last updated: June 2026

What SCD Solves

Slowly Changing Dimensions (SCD) describe how to handle updates to dimension (reference) data that changes infrequently over time — a customer's address, a product's price, an employee's department. The question each SCD type answers is: when a dimension value changes, do we overwrite history or preserve it?

SCD Type 1: Overwrite

SCD Type 1 simply overwrites the existing value with the new one. There is exactly one row per business key, always showing the current value. History is lost — you cannot tell what the value used to be.

MERGE INTO dim_customer t USING updates s ON t.cust_id = s.cust_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Type 1 keeps the table small and simple and is appropriate when historical values do not matter (correcting a typo, storing a latest-known phone number).

SCD Type 2: Version History

SCD Type 2 preserves history by inserting a new row each time a tracked value changes. Each version carries effective-date columns and usually a current-flag:

cust_idcity__START_AT__END_ATis_current
42Reno2024-01-012025-03-01false
42Boise2025-03-01nulltrue

The open record has a null __END_AT and is_current = true. Type 2 lets you answer point-in-time questions ("which city did customer 42 live in last March?") at the cost of more rows and bigger tables.

Automating SCD with AUTO CDC INTO

Doing SCD Type 2 by hand with MERGE is error-prone (closing old rows, handling out-of-order events). Databricks automates it in declarative pipelines with AUTO CDC INTO, the renamed and recommended successor to APPLY CHANGES INTO (same syntax; APPLY CHANGES still works but AUTO CDC is preferred).

CREATE OR REFRESH STREAMING TABLE dim_customer;

CREATE FLOW cdc_flow AS AUTO CDC INTO dim_customer
FROM STREAM(cdc_source)
KEYS (cust_id)
APPLY AS DELETE WHEN operation = 'DELETE'
SEQUENCE BY event_ts
STORED AS SCD TYPE 2;

The clauses you must recognize:

  • KEYS — the business key identifying a record.
  • SEQUENCE BY — the ordering column; the engine uses it to apply changes in correct logical order even when CDC events arrive out of order.
  • APPLY AS DELETE WHEN — treats matching rows as deletes (tombstones).
  • STORED AS SCD TYPE 1 or SCD TYPE 2 — chooses overwrite vs full history; Type 2 auto-populates __START_AT and __END_AT.
  • TRACK HISTORY ON (optional) — version only when specific columns change.

This collapses what would be 100+ lines of hand-written MERGE logic into a few declarative lines, and it natively understands streaming and late-arriving data without you managing watermarks.

Type 1 vs Type 2: Choosing and Comparing

AspectSCD Type 1SCD Type 2
HistoryLost (overwrite)Preserved (new row per change)
Rows per keyOneMany (one per version)
Extra columnsNone__START_AT, __END_AT, current flag
Use whenOnly current value matters; correctionsPoint-in-time / audit history needed
AUTO CDC clauseSTORED AS SCD TYPE 1STORED AS SCD TYPE 2

Exam pointers

  • If a scenario says "we must report what a customer's status was at the time of each order," that requires Type 2 — Type 1 cannot answer it.
  • If it says "just keep the latest address; we don't care about old ones," Type 1 is correct and cheaper.
  • For SCD Type 2 target tables, the schema must include the __START_AT and __END_AT columns with the same data type as the SEQUENCE BY column.
  • SEQUENCE BY is what makes AUTO CDC robust to out-of-order events — a distractor answer often credits this to watermarking instead.

Manual SCD Type 2 with MERGE (for contrast)

Understanding why AUTO CDC is valuable means seeing the manual alternative. Implementing SCD Type 2 by hand with MERGE INTO requires two logical steps in one statement: close the currently-open row when a tracked value changes, and insert the new version. Because a single MERGE cannot both update the old row and insert a new one for the same key cleanly, the common pattern unions the source with a marker that forces a non-match for the insert:

MERGE INTO dim_customer t
USING staged_updates s
ON t.cust_id = s.cust_id AND t.is_current = true
WHEN MATCHED AND t.city <> s.city THEN
  UPDATE SET is_current = false, __END_AT = s.event_ts
WHEN NOT MATCHED THEN
  INSERT (cust_id, city, __START_AT, __END_AT, is_current)
  VALUES (s.cust_id, s.city, s.event_ts, null, true);

This hand-written approach is fragile: it does not natively handle out-of-order events, requires careful staging, and grows long fast. AUTO CDC INTO replaces this entire pattern, which is exactly why Databricks recommends it.

Other SCD types you may see referenced

TypeBehavior
Type 0Retain original, never change (e.g., date-of-birth)
Type 1Overwrite, no history
Type 2New row per change, full history (most common)
Type 3Add a 'previous value' column (limited history)

The Associate exam focuses overwhelmingly on Type 1 vs Type 2 and on driving them with AUTO CDC INTO; Types 0 and 3 appear mainly as distractors.

Test Your Knowledge

A dimension table must let analysts query what a customer's loyalty tier was on any past date. Which SCD type is required?

A
B
C
D
Test Your Knowledge

In AUTO CDC INTO, what is the purpose of the SEQUENCE BY clause?

A
B
C
D
Test Your Knowledge

Which statement about AUTO CDC INTO and APPLY CHANGES INTO is correct?

A
B
C
D
Test Your Knowledge

For an AUTO CDC INTO target table stored as SCD TYPE 2, what must the target schema include?

A
B
C
D