4.6 Designing Idempotent and Recoverable Pipelines
Key Takeaways
- An idempotent pipeline produces the same result whether it runs once or is retried, preventing duplicates after failures and reruns.
- Structured Streaming uses a checkpoint location (RocksDB-backed) to record offsets and progress, enabling exactly-once recovery after a restart.
- Auto Loader tracks discovered files in its checkpoint to guarantee exactly-once ingestion; never delete or share a checkpoint across streams.
- foreachBatch provides only at-least-once delivery, so the batch function must be idempotent — use a Delta MERGE keyed on a business key to upsert without duplicates.
- Setting txnAppId and txnVersion (idempotent writes) makes Delta writes inside foreachBatch deduplicate on retry, restoring exactly-once behavior.
Idempotency: Why It Matters
A pipeline is idempotent when running it twice produces the same final state as running it once. In production, jobs fail mid-run, retries fire, and operators rerun yesterday's load — without idempotency, each of those creates duplicate rows or double-counted metrics. Designing for idempotency is therefore the core of recoverable pipeline design.
The two failure realities you design around:
- Retries and repair runs re-execute work that may have partially completed.
- Reprocessing (a manual rerun or backfill) replays the same source data.
The goal is exactly-once effective results even when underlying delivery is at-least-once. You achieve it by making writes deduplicate themselves — via checkpoints, keyed MERGEs, and Delta's idempotent-write transaction markers.
Checkpoints and Exactly-Once Recovery
Every Structured Streaming query needs a checkpoint location — a durable path that records the stream's offsets (what source data has been read) and progress, using an embedded RocksDB state store. On restart, the query reads the checkpoint and resumes exactly where it stopped, giving exactly-once processing for supported sinks (like Delta).
Rules the exam tests:
- Give each stream its own checkpoint; never share or copy one between queries.
- Put the checkpoint on storage without a lifecycle/expiration policy so files are not deleted out from under the stream.
- Deleting the checkpoint resets the stream to reprocess from the start (or your configured starting offset).
Auto Loader builds on this: it records every discovered file in its checkpoint (RocksDB), so a file is ingested exactly once even across restarts — you never reprocess a file already seen.
foreachBatch, MERGE, and Idempotent Upserts
Not every operation is exactly-once. foreachBatch — which lets you run arbitrary batch logic (including writes to multiple tables) on each micro-batch — provides only at-least-once delivery, because a batch can be reprocessed after a failure. Therefore your foreachBatch function must be idempotent.
The canonical idempotent write is a Delta MERGE keyed on a business/natural key:
MERGE INTO gold g
USING updates u ON g.id = u.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
Reprocessing the same micro-batch simply re-updates existing rows instead of inserting duplicates. For Gold aggregation tables, Update Mode with foreachBatch + MERGE is faster than Complete Mode on large data. This MERGE-on-key pattern is also how you implement deduplicated upserts and Slowly Changing Dimensions in the medallion architecture.
Delta Idempotent Writes (txnAppId / txnVersion)
When you write to Delta inside foreachBatch, a retried batch could write the same data twice. Delta solves this with idempotent writes: tag each write with txnAppId (a stable application identifier) and txnVersion (a monotonically increasing batch id, typically batchId). Delta records the highest committed (txnAppId, txnVersion); if a retry tries to commit a txnVersion it has already seen for that app id, Delta skips it, restoring exactly-once semantics.
| Mechanism | Guarantee | Use |
|---|---|---|
| Checkpoint | exactly-once recovery of stream offsets | every stream |
| Auto Loader file tracking | each file ingested once | file ingestion |
| MERGE on key | idempotent upsert | foreachBatch / Gold |
| txnAppId + txnVersion | dedup Delta writes on retry | writes inside foreachBatch |
Together these patterns let a pipeline fail, retry, and be re-run without ever double-writing — the definition of a recoverable, idempotent production pipeline.
Delivery Guarantees You Must Know
The exam contrasts three delivery semantics. Pin them down:
| Guarantee | Meaning | Example |
|---|---|---|
| At-most-once | May lose data, never duplicates | fire-and-forget, rarely used |
| At-least-once | Never loses, may duplicate | foreachBatch, must add idempotency |
| Exactly-once | No loss, no duplicates | Delta sink + checkpoint, or MERGE/txn dedup |
Structured Streaming with a Delta sink and a checkpoint gives exactly-once end-to-end for that sink. The moment you step outside it — writing to an external system, or to multiple tables inside foreachBatch — you drop to at-least-once and must restore exactly-once yourself with keyed MERGE or txnAppId/txnVersion.
Designing Backfills and Reruns
Reprocessing is a first-class scenario, not an emergency. A well-designed pipeline lets an operator rerun any day's load safely because every write is keyed:
- Backfills replay source data; the MERGE-on-key upsert ensures the second run overwrites rather than duplicates.
- Repair runs rerun failed tasks; idempotent writes make the partial first attempt harmless.
- Schema-stable keys (a natural business key, not a row-arrival timestamp) are what make MERGE deduplication reliable across reruns.
Idempotency is therefore a design property you build in from the medallion Bronze-to-Gold flow, not a patch you add after the first duplicate-row incident.
Exam pointers
Lock in: a checkpoint records stream offsets/progress (RocksDB) and gives exactly-once recovery — one per stream, on storage without a lifecycle policy; Auto Loader tracks discovered files in its checkpoint for exactly-once ingestion; foreachBatch is at-least-once, so the batch must be idempotent via a MERGE on a business key; and Delta idempotent writes using txnAppId + txnVersion dedup retried writes. Together these make a pipeline safe to fail, retry, and reprocess without duplicates.
When in doubt, ask whether re-running the job would change the final table: if it would create duplicates, the write is not idempotent and needs a keyed MERGE or txnAppId/txnVersion guard before it is production-ready.
What does the checkpoint location provide for a Structured Streaming query?
Because foreachBatch provides only at-least-once delivery, what is the recommended way to write to a Gold table without creating duplicates on retry?
Which Delta feature makes a write inside foreachBatch skip data it has already committed when a batch is retried?
Which checkpoint practice is correct for production streaming?