3.7 Slowly Changing Dimensions (SCD) Patterns
Key Takeaways
- Slowly Changing Dimensions (SCD) handle updates to reference/dimension data that changes over time (e.g., customer addresses, product prices).
- SCD Type 1 overwrites the existing record with the new value, losing historical data but keeping the table simple.
- SCD Type 2 preserves history by creating a new row for each change, with effective date columns to track when each version was valid.
- MERGE INTO is the primary SQL command for implementing both SCD Type 1 and SCD Type 2 patterns in Delta Lake.
- Change Data Feed (CDF) can track row-level changes for downstream SCD processing.
Slowly Changing Dimensions (SCD) Patterns
Quick Answer: SCD Type 1 overwrites records (no history). SCD Type 2 creates new rows for each change (full history with effective dates). Both are implemented using MERGE INTO in Delta Lake.
SCD Type 1: Overwrite
SCD Type 1 replaces the existing value with the new value. No history is maintained.
-- SCD Type 1: Update existing record with new value
MERGE INTO dim_customers AS target
USING staging_customers AS source
ON target.customer_id = source.customer_id
WHEN MATCHED THEN
UPDATE SET
target.name = source.name,
target.address = source.address,
target.phone = source.phone,
target.updated_at = current_timestamp()
WHEN NOT MATCHED THEN
INSERT (customer_id, name, address, phone, updated_at)
VALUES (source.customer_id, source.name, source.address, source.phone, current_timestamp());
Use SCD Type 1 when:
- Historical values are not needed
- Only the current state matters (e.g., correcting a typo)
- Simplicity is preferred over history tracking
SCD Type 2: Historical Tracking
SCD Type 2 preserves the full history of changes by creating a new row for each change:
-- SCD Type 2: Close existing record and insert new version
MERGE INTO dim_customers AS target
USING (
SELECT
customer_id,
name,
address,
phone,
current_timestamp() AS effective_start,
NULL AS effective_end,
TRUE AS is_current
FROM staging_customers
) AS source
ON target.customer_id = source.customer_id AND target.is_current = TRUE
WHEN MATCHED AND (
target.name != source.name OR
target.address != source.address OR
target.phone != source.phone
) THEN
UPDATE SET
target.effective_end = current_timestamp(),
target.is_current = FALSE
WHEN NOT MATCHED THEN
INSERT (customer_id, name, address, phone, effective_start, effective_end, is_current)
VALUES (source.customer_id, source.name, source.address, source.phone, source.effective_start, source.effective_end, source.is_current);
-- Then insert the new current version for changed records
INSERT INTO dim_customers
SELECT
s.customer_id, s.name, s.address, s.phone,
current_timestamp() AS effective_start,
NULL AS effective_end,
TRUE AS is_current
FROM staging_customers s
JOIN dim_customers d ON s.customer_id = d.customer_id
WHERE d.is_current = FALSE
AND d.effective_end = (SELECT MAX(effective_end) FROM dim_customers WHERE customer_id = s.customer_id);
SCD Type 2 Table Structure
| Column | Type | Description |
|---|---|---|
| surrogate_key | BIGINT | Unique row identifier (auto-generated) |
| customer_id | BIGINT | Natural/business key |
| name | STRING | Customer name (versioned) |
| address | STRING | Customer address (versioned) |
| effective_start | TIMESTAMP | When this version became active |
| effective_end | TIMESTAMP | When this version was superseded (NULL = current) |
| is_current | BOOLEAN | Whether this is the active version |
Use SCD Type 2 when:
- Full change history is required (regulatory compliance, auditing)
- Analytics need to report based on historical states
- Data warehouse dimensional modeling requirements
On the Exam: Know the difference between SCD Type 1 (overwrite, no history) and SCD Type 2 (new rows, full history). Both use MERGE INTO. SCD Type 2 requires effective date columns and a current flag.
A data engineer needs to track the full history of customer address changes for compliance reporting. Which SCD type should they implement?
In a SCD Type 2 table, how do you identify the current (active) version of a customer record?
Which SQL command is used to implement both SCD Type 1 and SCD Type 2 patterns in Delta Lake?