4.6 Designing Idempotent and Recoverable Pipelines

Key Takeaways

An idempotent pipeline produces the same result whether it runs once or is retried, preventing duplicates after failures and reruns.
Structured Streaming uses a checkpoint location (RocksDB-backed) to record offsets and progress, enabling exactly-once recovery after a restart.
Auto Loader tracks discovered files in its checkpoint to guarantee exactly-once ingestion; never delete or share a checkpoint across streams.
foreachBatch provides only at-least-once delivery, so the batch function must be idempotent — use a Delta MERGE keyed on a business key to upsert without duplicates.
Setting txnAppId and txnVersion (idempotent writes) makes Delta writes inside foreachBatch deduplicate on retry, restoring exactly-once behavior.

Last updated: June 2026

Idempotency: Why It Matters

A pipeline is idempotent when running it twice produces the same final state as running it once. In production, jobs fail mid-run, retries fire, and operators rerun yesterday's load — without idempotency, each of those creates duplicate rows or double-counted metrics. Designing for idempotency is therefore the core of recoverable pipeline design.

The two failure realities you design around:

Retries and repair runs re-execute work that may have partially completed.
Reprocessing (a manual rerun or backfill) replays the same source data.

The goal is exactly-once effective results even when underlying delivery is at-least-once. You achieve it by making writes deduplicate themselves — via checkpoints, keyed MERGEs, and Delta's idempotent-write transaction markers.

Checkpoints and Exactly-Once Recovery

Every Structured Streaming query needs a checkpoint location — a durable path that records the stream's offsets (what source data has been read) and progress, using an embedded RocksDB state store. On restart, the query reads the checkpoint and resumes exactly where it stopped, giving exactly-once processing for supported sinks (like Delta).

Rules the exam tests:

Give each stream its own checkpoint; never share or copy one between queries.
Put the checkpoint on storage without a lifecycle/expiration policy so files are not deleted out from under the stream.
Deleting the checkpoint resets the stream to reprocess from the start (or your configured starting offset).

Auto Loader builds on this: it records every discovered file in its checkpoint (RocksDB), so a file is ingested exactly once even across restarts — you never reprocess a file already seen.

foreachBatch, MERGE, and Idempotent Upserts

Not every operation is exactly-once. foreachBatch — which lets you run arbitrary batch logic (including writes to multiple tables) on each micro-batch — provides only at-least-once delivery, because a batch can be reprocessed after a failure. Therefore your foreachBatch function must be idempotent.

The canonical idempotent write is a Delta MERGE keyed on a business/natural key:

MERGE INTO gold g
USING updates u ON g.id = u.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *

Reprocessing the same micro-batch simply re-updates existing rows instead of inserting duplicates. For Gold aggregation tables, Update Mode with foreachBatch + MERGE is faster than Complete Mode on large data. This MERGE-on-key pattern is also how you implement deduplicated upserts and Slowly Changing Dimensions in the medallion architecture.

Delta Idempotent Writes (txnAppId / txnVersion)

When you write to Delta inside foreachBatch, a retried batch could write the same data twice. Delta solves this with idempotent writes: tag each write with txnAppId (a stable application identifier) and txnVersion (a monotonically increasing batch id, typically batchId). Delta records the highest committed (txnAppId, txnVersion); if a retry tries to commit a txnVersion it has already seen for that app id, Delta skips it, restoring exactly-once semantics.

Mechanism	Guarantee	Use
Checkpoint	exactly-once recovery of stream offsets	every stream
Auto Loader file tracking	each file ingested once	file ingestion
MERGE on key	idempotent upsert	foreachBatch / Gold
txnAppId + txnVersion	dedup Delta writes on retry	writes inside foreachBatch

Together these patterns let a pipeline fail, retry, and be re-run without ever double-writing — the definition of a recoverable, idempotent production pipeline.

Delivery Guarantees You Must Know

The exam contrasts three delivery semantics. Pin them down:

Guarantee	Meaning	Example
At-most-once	May lose data, never duplicates	fire-and-forget, rarely used
At-least-once	Never loses, may duplicate	foreachBatch, must add idempotency
Exactly-once	No loss, no duplicates	Delta sink + checkpoint, or MERGE/txn dedup

Structured Streaming with a Delta sink and a checkpoint gives exactly-once end-to-end for that sink. The moment you step outside it — writing to an external system, or to multiple tables inside foreachBatch — you drop to at-least-once and must restore exactly-once yourself with keyed MERGE or txnAppId/txnVersion.

Designing Backfills and Reruns

Reprocessing is a first-class scenario, not an emergency. A well-designed pipeline lets an operator rerun any day's load safely because every write is keyed:

Backfills replay source data; the MERGE-on-key upsert ensures the second run overwrites rather than duplicates.
Repair runs rerun failed tasks; idempotent writes make the partial first attempt harmless.
Schema-stable keys (a natural business key, not a row-arrival timestamp) are what make MERGE deduplication reliable across reruns.

Idempotency is therefore a design property you build in from the medallion Bronze-to-Gold flow, not a patch you add after the first duplicate-row incident.

Exam pointers

Lock in: a checkpoint records stream offsets/progress (RocksDB) and gives exactly-once recovery — one per stream, on storage without a lifecycle policy; Auto Loader tracks discovered files in its checkpoint for exactly-once ingestion; foreachBatch is at-least-once, so the batch must be idempotent via a MERGE on a business key; and Delta idempotent writes using txnAppId + txnVersion dedup retried writes. Together these make a pipeline safe to fail, retry, and reprocess without duplicates.

When in doubt, ask whether re-running the job would change the final table: if it would create duplicates, the write is not idempotent and needs a keyed MERGE or txnAppId/txnVersion guard before it is production-ready.

Test Your Knowledge

What does the checkpoint location provide for a Structured Streaming query?

It caches the entire source dataset in memory

It durably records offsets and progress so the query resumes exactly where it stopped after a restart

It compresses the output files

It disables retries

Test Your Knowledge

Because foreachBatch provides only at-least-once delivery, what is the recommended way to write to a Gold table without creating duplicates on retry?

Use a Delta MERGE keyed on a business key so reprocessing updates existing rows instead of inserting duplicates

Use a plain INSERT and deduplicate later

Disable the checkpoint

Switch the sink to a CSV file

Test Your Knowledge

Which Delta feature makes a write inside foreachBatch skip data it has already committed when a batch is retried?

Z-ORDER

VACUUM

Idempotent writes via txnAppId and txnVersion

Liquid clustering

Test Your Knowledge

Which checkpoint practice is correct for production streaming?

Share one checkpoint across all streams to save storage

Store the checkpoint where a lifecycle policy deletes old files

Give each stream its own checkpoint on storage without an expiration/lifecycle policy

Delete the checkpoint nightly to keep it small

Up Next

5.1 Unity Catalog: Architecture and Three-Level Namespace

Domain 5: Data Governance & Quality (11%)

Databricks Certified Data Engineer Associate

Databricks Certified Data Engineer Associate

4.6 Designing Idempotent and Recoverable Pipelines

Key Takeaways

Idempotency: Why It Matters

Checkpoints and Exactly-Once Recovery

foreachBatch, MERGE, and Idempotent Upserts

Delta Idempotent Writes (txnAppId / txnVersion)

Delivery Guarantees You Must Know

Designing Backfills and Reruns

Exam pointers

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

Databricks Certified Data Engineer Associate

4.6 Designing Idempotent and Recoverable Pipelines

Key Takeaways

Idempotency: Why It Matters

Checkpoints and Exactly-Once Recovery

foreachBatch, MERGE, and Idempotent Upserts

Delta Idempotent Writes (txnAppId / txnVersion)

Delivery Guarantees You Must Know

Designing Backfills and Reruns

Exam pointers