You must keep only the most recent record for each order_id from a table containing multiple versions. Which approach is most appropriate?

ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY event_ts DESC) and filter rn = 1. Ranking rows per order_id by timestamp descending and keeping rn = 1 selects exactly the latest version of each key, which DISTINCT cannot do because the versions differ in their values. CROSS JOIN and VACUUM are unrelated to choosing a row per key.

Why is dropDuplicatesWithinWatermark preferred over a plain dropDuplicates in a long-running stream?

It bounds the deduplication state to the watermark window instead of remembering every key forever. A plain dropDuplicates must retain every seen key indefinitely to detect future duplicates, growing state without limit, whereas dropDuplicatesWithinWatermark expires state outside the watermark window to keep memory bounded. The trade-off is that very late duplicates past the window may slip through.

During an incremental load, which technique best prevents duplicate rows from entering the target table when the same record may be delivered more than once?

MERGE INTO matching on the key, updating on match and inserting only when not matched. MERGE matches incoming records to existing rows by key and updates rather than inserting on a match, ensuring one row per key even when a record is delivered repeatedly. A plain INSERT would accumulate duplicates, and VACUUM and CROSS JOIN do not deduplicate by key.

Data Deduplication Strategies — Free Study Guide 2026

Key Takeaways

Deduplication removes repeated records and is a core silver-layer responsibility for trustworthy data.
ROW_NUMBER() with PARTITION BY ... ORDER BY, keeping rn = 1, is the standard way to retain the latest record per key.
MERGE INTO on a key prevents duplicates from entering a target during incremental upserts.
Streaming deduplication uses dropDuplicates or dropDuplicatesWithinWatermark to bound how long state is kept for late duplicates.
DISTINCT and GROUP BY handle exact whole-row duplicates but cannot pick a 'best' row per key the way ROW_NUMBER can.

Why Deduplicate

Duplicate records sneak in constantly: at-least-once streaming delivery, retried batches, overlapping source extracts, and CDC replays. Deduplication removes these repeats and is one of the defining jobs of the silver layer. The right technique depends on whether duplicates are exact whole-row copies or multiple versions of the same key where you must keep one "best" row.

ROW_NUMBER(): Keep the Best Row per Key

The most flexible and most-tested pattern uses ROW_NUMBER() to rank rows within each key and keep only the top one:

WITH ranked AS (
  SELECT *,
    ROW_NUMBER() OVER (
      PARTITION BY order_id
      ORDER BY event_ts DESC
    ) AS rn
  FROM bronze_orders
)
SELECT * EXCEPT (rn) FROM ranked WHERE rn = 1;

Here PARTITION BY order_id groups duplicates of the same key, ORDER BY event_ts DESC ranks newest first, and WHERE rn = 1 keeps only the latest version of each order. Change the ORDER BY to keep the highest-priority or earliest record instead. This pattern wins on the exam because it lets you choose which duplicate survives — something DISTINCT cannot do.

DISTINCT, GROUP BY, MERGE

DISTINCT and GROUP BY remove exact duplicates — rows identical across all selected columns:

SELECT DISTINCT * FROM staging;

They are simple but blunt: they cannot decide which of several differing rows for the same key to keep, so use them only when duplicates are truly identical.

MERGE INTO prevents duplicates during incremental loads. By matching the incoming batch to the target on a key and updating instead of inserting on a match, MERGE guarantees one row per key no matter how many times a record is delivered:

MERGE INTO silver t USING updates s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

This is the idempotent upsert pattern from incremental processing, applied specifically to keep the target duplicate-free across runs.

Technique	Best for
ROW_NUMBER() rn=1	Keep one chosen row per key (latest/best)
DISTINCT / GROUP BY	Remove exact whole-row duplicates
MERGE INTO	Prevent duplicates on incremental upsert
dropDuplicatesWithinWatermark	Streaming dedupe with bounded state

Streaming Deduplication

In Structured Streaming, dropDuplicates removes duplicate rows, but naively it must remember every key forever to detect a future duplicate — unbounded, ever-growing state. The fix is dropDuplicatesWithinWatermark() (paired with a watermark), which only deduplicates within a bounded time window and then expires old state:

(stream
  .withWatermark("event_ts", "1 hour")
  .dropDuplicatesWithinWatermark(["order_id"]))

This keeps deduplication state for only the watermark interval (here one hour), so memory stays bounded while still catching duplicates that arrive close together. The trade-off: a duplicate arriving after the watermark window may slip through, so size the window to your real duplication latency.

Choosing a strategy on the exam

"Keep the most recent record per customer" -> ROW_NUMBER() ordering by timestamp DESC, rn = 1.
"Rows are exact copies" -> DISTINCT or GROUP BY.
"Avoid inserting duplicates during an incremental load" -> MERGE on the key.
"Deduplicate a stream without unbounded state" -> dropDuplicatesWithinWatermark with a watermark.

Matching the scenario to the correct tool — especially recognizing when bounded streaming state is the constraint — is the heart of this objective.

Deduplication in the Multi-Hop Flow

Deduplication is almost always a silver-layer task: bronze deliberately keeps every raw record (including duplicates) for replayability, and silver is where you collapse them to one trustworthy row per entity. A typical silver build combines two defenses:

Within-batch dedupe using ROW_NUMBER to pick the best record from the incoming data.
Cross-batch dedupe using MERGE so a record already in the target is updated, not re-inserted.

MERGE INTO silver_orders t
USING (
  SELECT * EXCEPT(rn) FROM (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY event_ts DESC) rn
    FROM bronze_orders_batch)
  WHERE rn = 1) s
ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

This pattern is robust: the inner ROW_NUMBER removes duplicates inside the batch, and the outer MERGE removes duplicates across batches.

Edge cases and gotchas

dropDuplicates without a watermark in a stream eventually exhausts memory because state grows unbounded — always pair streaming dedupe with a watermark for production.
DISTINCT shuffles all columns, so on wide tables it is expensive; prefer ROW_NUMBER on the key when you only need uniqueness per key.
MERGE with multiple source rows per key fails unless you pre-deduplicate the source — "cannot perform Merge as multiple source rows matched" is a real error and a common exam scenario. The fix is the inner ROW_NUMBER shown above.
collect_set can deduplicate values within a grouped aggregation when you are building arrays rather than rows.
Idempotency vs deduplication are related but distinct: an idempotent MERGE keeps the target correct on retries, while deduplication removes duplicates that are genuinely present in the source data. Production silver tables usually need both — a ROW_NUMBER dedupe of the incoming batch followed by a key-based MERGE — so that neither source duplicates nor pipeline retries can corrupt the one-row-per-key guarantee.

Databricks Certified Data Engineer Associate

Databricks Certified Data Engineer Associate

3.8 Data Deduplication Strategies

Key Takeaways

Why Deduplicate

ROW_NUMBER(): Keep the Best Row per Key

DISTINCT, GROUP BY, MERGE

Streaming Deduplication

Choosing a strategy on the exam

Deduplication in the Multi-Hop Flow

Edge cases and gotchas

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

Databricks Certified Data Engineer Associate

3.8 Data Deduplication Strategies

Key Takeaways

Why Deduplicate

ROW_NUMBER(): Keep the Best Row per Key

DISTINCT, GROUP BY, MERGE

Streaming Deduplication

Choosing a strategy on the exam

Deduplication in the Multi-Hop Flow

Edge cases and gotchas