3.8 Data Deduplication Strategies

Key Takeaways

  • Deduplication removes repeated records and is a core silver-layer responsibility for trustworthy data.
  • ROW_NUMBER() with PARTITION BY ... ORDER BY, keeping rn = 1, is the standard way to retain the latest record per key.
  • MERGE INTO on a key prevents duplicates from entering a target during incremental upserts.
  • Streaming deduplication uses dropDuplicates or dropDuplicatesWithinWatermark to bound how long state is kept for late duplicates.
  • DISTINCT and GROUP BY handle exact whole-row duplicates but cannot pick a 'best' row per key the way ROW_NUMBER can.
Last updated: June 2026

Why Deduplicate

Duplicate records sneak in constantly: at-least-once streaming delivery, retried batches, overlapping source extracts, and CDC replays. Deduplication removes these repeats and is one of the defining jobs of the silver layer. The right technique depends on whether duplicates are exact whole-row copies or multiple versions of the same key where you must keep one "best" row.

ROW_NUMBER(): Keep the Best Row per Key

The most flexible and most-tested pattern uses ROW_NUMBER() to rank rows within each key and keep only the top one:

WITH ranked AS (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY order_id
      ORDER BY event_ts DESC
    ) AS rn
  FROM bronze_orders
)
SELECT * EXCEPT (rn) FROM ranked WHERE rn = 1;

Here PARTITION BY order_id groups duplicates of the same key, ORDER BY event_ts DESC ranks newest first, and WHERE rn = 1 keeps only the latest version of each order. Change the ORDER BY to keep the highest-priority or earliest record instead. This pattern wins on the exam because it lets you choose which duplicate survives — something DISTINCT cannot do.

DISTINCT, GROUP BY, MERGE

DISTINCT and GROUP BY remove exact duplicates — rows identical across all selected columns:

SELECT DISTINCT * FROM staging;

They are simple but blunt: they cannot decide which of several differing rows for the same key to keep, so use them only when duplicates are truly identical.

MERGE INTO prevents duplicates during incremental loads. By matching the incoming batch to the target on a key and updating instead of inserting on a match, MERGE guarantees one row per key no matter how many times a record is delivered:

MERGE INTO silver t USING updates s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

This is the idempotent upsert pattern from incremental processing, applied specifically to keep the target duplicate-free across runs.

TechniqueBest for
ROW_NUMBER() rn=1Keep one chosen row per key (latest/best)
DISTINCT / GROUP BYRemove exact whole-row duplicates
MERGE INTOPrevent duplicates on incremental upsert
dropDuplicatesWithinWatermarkStreaming dedupe with bounded state

Streaming Deduplication

In Structured Streaming, dropDuplicates removes duplicate rows, but naively it must remember every key forever to detect a future duplicate — unbounded, ever-growing state. The fix is dropDuplicatesWithinWatermark() (paired with a watermark), which only deduplicates within a bounded time window and then expires old state:

(stream
  .withWatermark("event_ts", "1 hour")
  .dropDuplicatesWithinWatermark(["order_id"]))

This keeps deduplication state for only the watermark interval (here one hour), so memory stays bounded while still catching duplicates that arrive close together. The trade-off: a duplicate arriving after the watermark window may slip through, so size the window to your real duplication latency.

Choosing a strategy on the exam

  • "Keep the most recent record per customer" -> ROW_NUMBER() ordering by timestamp DESC, rn = 1.
  • "Rows are exact copies" -> DISTINCT or GROUP BY.
  • "Avoid inserting duplicates during an incremental load" -> MERGE on the key.
  • "Deduplicate a stream without unbounded state" -> dropDuplicatesWithinWatermark with a watermark.

Matching the scenario to the correct tool — especially recognizing when bounded streaming state is the constraint — is the heart of this objective.

Deduplication in the Multi-Hop Flow

Deduplication is almost always a silver-layer task: bronze deliberately keeps every raw record (including duplicates) for replayability, and silver is where you collapse them to one trustworthy row per entity. A typical silver build combines two defenses:

  1. Within-batch dedupe using ROW_NUMBER to pick the best record from the incoming data.
  2. Cross-batch dedupe using MERGE so a record already in the target is updated, not re-inserted.
MERGE INTO silver_orders t
USING (
  SELECT * EXCEPT(rn) FROM (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY event_ts DESC) rn
    FROM bronze_orders_batch)
  WHERE rn = 1) s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

This pattern is robust: the inner ROW_NUMBER removes duplicates inside the batch, and the outer MERGE removes duplicates across batches.

Edge cases and gotchas

  • dropDuplicates without a watermark in a stream eventually exhausts memory because state grows unbounded — always pair streaming dedupe with a watermark for production.
  • DISTINCT shuffles all columns, so on wide tables it is expensive; prefer ROW_NUMBER on the key when you only need uniqueness per key.
  • MERGE with multiple source rows per key fails unless you pre-deduplicate the source — "cannot perform Merge as multiple source rows matched" is a real error and a common exam scenario. The fix is the inner ROW_NUMBER shown above.
  • collect_set can deduplicate values within a grouped aggregation when you are building arrays rather than rows.
  • Idempotency vs deduplication are related but distinct: an idempotent MERGE keeps the target correct on retries, while deduplication removes duplicates that are genuinely present in the source data. Production silver tables usually need both — a ROW_NUMBER dedupe of the incoming batch followed by a key-based MERGE — so that neither source duplicates nor pipeline retries can corrupt the one-row-per-key guarantee.
Test Your Knowledge

You must keep only the most recent record for each order_id from a table containing multiple versions. Which approach is most appropriate?

A
B
C
D
Test Your Knowledge

Why is dropDuplicatesWithinWatermark preferred over a plain dropDuplicates in a long-running stream?

A
B
C
D
Test Your Knowledge

During an incremental load, which technique best prevents duplicate rows from entering the target table when the same record may be delivered more than once?

A
B
C
D