3.8 Data Deduplication Strategies
Key Takeaways
- Deduplication removes duplicate records from datasets, critical for maintaining data quality in the silver layer.
- ROW_NUMBER() with PARTITION BY is the most common deduplication pattern, keeping only the latest or highest-priority record per key.
- MERGE INTO with deduplication logic prevents duplicates from entering the target table during incremental processing.
- Streaming deduplication uses dropDuplicatesWithinWatermark() to deduplicate records within a time window.
- DISTINCT and GROUP BY are simple deduplication approaches for exact matches across all columns.
Data Deduplication Strategies
Quick Answer: Use ROW_NUMBER() OVER (PARTITION BY key ORDER BY timestamp DESC) to keep the latest record per key. Use MERGE INTO for deduplication during incremental loads. Use dropDuplicatesWithinWatermark() for streaming deduplication.
Why Deduplication Matters
Data arrives with duplicates for many reasons:
- At-least-once delivery from message queues (Kafka, Event Hubs)
- Retry logic in ETL pipelines that re-processes records
- Multiple source systems providing overlapping data
- Late-arriving data that triggers reprocessing
- Schema evolution causing records to be re-ingested
ROW_NUMBER() Deduplication (Most Common)
-- Keep only the latest record per order_id
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY updated_at DESC
) AS row_num
FROM bronze.raw_orders
)
WHERE row_num = 1;
Variations
-- Keep the FIRST record per key (first arrival wins)
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY _ingest_timestamp ASC)
-- Keep the LATEST record per key (last update wins)
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC)
-- Keep the HIGHEST priority record
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY priority DESC, updated_at DESC)
MERGE-Based Deduplication
Prevent duplicates during incremental loads:
-- Only insert records that don't already exist in the target
MERGE INTO silver.orders AS target
USING bronze.raw_orders AS source
ON target.order_id = source.order_id
WHEN NOT MATCHED THEN INSERT *;
-- Upsert with deduplication: update existing, insert new
MERGE INTO silver.orders AS target
USING (
-- First deduplicate the source
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) AS rn
FROM bronze.new_orders
) WHERE rn = 1
) AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
Streaming Deduplication
# Deduplicate within a watermark window
(spark.readStream
.table("bronze.events")
.withWatermark("event_time", "10 minutes")
.dropDuplicatesWithinWatermark(["event_id"])
.writeStream
.option("checkpointLocation", "/checkpoints/dedup")
.toTable("silver.events")
)
Simple Deduplication
-- DISTINCT: remove exact duplicate rows (all columns must match)
SELECT DISTINCT * FROM raw_data;
-- GROUP BY: deduplicate by key, aggregate other columns
SELECT
customer_id,
MAX(name) AS name,
MAX(email) AS email,
MAX(updated_at) AS updated_at
FROM raw_customers
GROUP BY customer_id;
On the Exam: ROW_NUMBER() with PARTITION BY is the most common and most testable deduplication pattern. Know how to combine it with MERGE INTO for incremental deduplication during silver-layer processing.
A bronze table has multiple records per order_id due to at-least-once delivery. A data engineer needs to keep only the most recent version of each order in the silver table. Which approach should they use?
How can a data engineer prevent duplicate records from being inserted into a silver table during incremental MERGE operations?