3.8 Data Deduplication Strategies

Key Takeaways

Deduplication removes duplicate records from datasets, critical for maintaining data quality in the silver layer.
ROW_NUMBER() with PARTITION BY is the most common deduplication pattern, keeping only the latest or highest-priority record per key.
MERGE INTO with deduplication logic prevents duplicates from entering the target table during incremental processing.
Streaming deduplication uses dropDuplicatesWithinWatermark() to deduplicate records within a time window.
DISTINCT and GROUP BY are simple deduplication approaches for exact matches across all columns.

Last updated: March 2026

Data Deduplication Strategies

Quick Answer: Use ROW_NUMBER() OVER (PARTITION BY key ORDER BY timestamp DESC) to keep the latest record per key. Use MERGE INTO for deduplication during incremental loads. Use dropDuplicatesWithinWatermark() for streaming deduplication.

Why Deduplication Matters

Data arrives with duplicates for many reasons:

At-least-once delivery from message queues (Kafka, Event Hubs)
Retry logic in ETL pipelines that re-processes records
Multiple source systems providing overlapping data
Late-arriving data that triggers reprocessing
Schema evolution causing records to be re-ingested

ROW_NUMBER() Deduplication (Most Common)

-- Keep only the latest record per order_id
SELECT * FROM (
    SELECT *,
        ROW_NUMBER() OVER (
            PARTITION BY order_id
            ORDER BY updated_at DESC
        ) AS row_num
    FROM bronze.raw_orders
)
WHERE row_num = 1;

Variations

-- Keep the FIRST record per key (first arrival wins)
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY _ingest_timestamp ASC)

-- Keep the LATEST record per key (last update wins)
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC)

-- Keep the HIGHEST priority record
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY priority DESC, updated_at DESC)

MERGE-Based Deduplication

Prevent duplicates during incremental loads:

-- Only insert records that don't already exist in the target
MERGE INTO silver.orders AS target
USING bronze.raw_orders AS source
ON target.order_id = source.order_id
WHEN NOT MATCHED THEN INSERT *;

-- Upsert with deduplication: update existing, insert new
MERGE INTO silver.orders AS target
USING (
    -- First deduplicate the source
    SELECT * FROM (
        SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) AS rn
        FROM bronze.new_orders
    ) WHERE rn = 1
) AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Streaming Deduplication

# Deduplicate within a watermark window
(spark.readStream
    .table("bronze.events")
    .withWatermark("event_time", "10 minutes")
    .dropDuplicatesWithinWatermark(["event_id"])
    .writeStream
    .option("checkpointLocation", "/checkpoints/dedup")
    .toTable("silver.events")
)

Simple Deduplication

-- DISTINCT: remove exact duplicate rows (all columns must match)
SELECT DISTINCT * FROM raw_data;

-- GROUP BY: deduplicate by key, aggregate other columns
SELECT
    customer_id,
    MAX(name) AS name,
    MAX(email) AS email,
    MAX(updated_at) AS updated_at
FROM raw_customers
GROUP BY customer_id;

On the Exam: ROW_NUMBER() with PARTITION BY is the most common and most testable deduplication pattern. Know how to combine it with MERGE INTO for incremental deduplication during silver-layer processing.

Test Your Knowledge

A bronze table has multiple records per order_id due to at-least-once delivery. A data engineer needs to keep only the most recent version of each order in the silver table. Which approach should they use?

SELECT DISTINCT * FROM bronze_orders

Use ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY updated_at DESC) and keep rows where row_num = 1

Use GROUP BY order_id with COUNT(*)

Delete duplicate rows using DELETE FROM with a subquery

Test Your Knowledge

How can a data engineer prevent duplicate records from being inserted into a silver table during incremental MERGE operations?

Add a UNIQUE constraint on the key column

Use MERGE INTO with only WHEN NOT MATCHED THEN INSERT to skip records that already exist in the target

Use INSERT OVERWRITE to replace all existing data

Set the Delta table property delta.allowDuplicates = false

Up Next

4.1 Databricks Workflows and Lakeflow Jobs

Domain 4: Productionizing Data Pipelines (18%)

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)