A data engineer's batch pipeline fails midway and needs to be restarted. The pipeline uses INSERT INTO to append data. What problem will occur on restart?

Duplicate rows will be inserted because INSERT INTO is not idempotent. INSERT INTO appends rows without checking for duplicates. If the pipeline is rerun, the same rows are appended again, creating duplicates. The fix is to use MERGE INTO (which checks for existing keys) or INSERT OVERWRITE (which replaces all data).

Which of the following write operations is NOT idempotent?

INSERT INTO without deduplication logic. INSERT INTO appends rows each time it runs, creating duplicates on rerun. MERGE INTO, INSERT OVERWRITE, and CREATE OR REPLACE TABLE all produce the same result regardless of how many times they are executed, making them idempotent.

How does a Structured Streaming pipeline recover after a failure?

It uses the checkpoint to determine where it left off and resumes from the last committed batch. Streaming checkpoints store the offsets (positions) of processed data and the commit log. When a failed stream is restarted, it reads the checkpoint to determine the last successfully committed batch and resumes processing from that point, providing exactly-once guarantees.

Designing Idempotent and Recoverable Pipelines

Quick Answer: Idempotent pipelines produce the same result when run multiple times. Use MERGE INTO for upserts, INSERT OVERWRITE for full refreshes, and checkpoints for streaming recovery. Avoid INSERT INTO without deduplication logic.

What Is Idempotency?

An operation is idempotent if executing it multiple times produces the same result as executing it once.

Operation	Idempotent?	Why
MERGE INTO (upsert)	Yes	Updates existing rows, inserts new — rerunning has no additional effect
INSERT OVERWRITE	Yes	Replaces all data with the same result each time
CREATE OR REPLACE TABLE	Yes	Atomically replaces table — same result on rerun
INSERT INTO (append)	No	Creates duplicate rows on rerun
Auto Loader	Yes	Checkpoints track processed files — skips already-processed files
COPY INTO	Yes	Tracks processed files — skips duplicates

Idempotent Patterns

Pattern 1: MERGE INTO for Incremental Updates

-- Safe to rerun: updates existing, inserts new, no duplicates
MERGE INTO silver.orders t
USING bronze.raw_orders s ON t.order_id = s.order_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Pattern 2: INSERT OVERWRITE for Full Refresh

-- Safe to rerun: replaces everything with the same result
INSERT OVERWRITE gold.daily_summary
SELECT order_date, SUM(amount) AS total
FROM silver.orders
GROUP BY order_date;

Pattern 3: CREATE OR REPLACE TABLE

-- Safe to rerun: atomic table replacement
CREATE OR REPLACE TABLE gold.customer_summary AS
SELECT customer_id, COUNT(*) AS orders, SUM(amount) AS total
FROM silver.orders
GROUP BY customer_id;

Anti-Pattern: Non-Idempotent INSERT INTO

-- DANGEROUS: Creates duplicates if rerun
INSERT INTO silver.orders
SELECT * FROM bronze.raw_orders;
-- First run: 1000 rows
-- Second run: 2000 rows (duplicates!)

Fix: Add Deduplication

-- Fix with MERGE (dedup on insert)
MERGE INTO silver.orders t
USING bronze.raw_orders s ON t.order_id = s.order_id
WHEN NOT MATCHED THEN INSERT *;

Recoverability

Streaming Recovery

# Checkpoints enable exactly-once recovery
(stream_df.writeStream
    .option("checkpointLocation", "/checkpoints/my_stream")
    .trigger(availableNow=True)
    .toTable("silver.events")
)
# If the stream fails midway:
# - Restart the same code
# - Checkpoint tells Spark where it left off
# - Processing resumes from the last committed batch

Batch Recovery

# For batch pipelines, use Delta Lake transactions for atomicity
try:
    # This is atomic: either all rows are written or none
    df.write.mode("overwrite").saveAsTable("silver.daily_data")
except Exception as e:
    # The table remains in its previous state
    print(f"Write failed: {e}")
    # Safe to retry — overwrite is idempotent

Design Principles for Production Pipelines

Principle	Implementation
Idempotent writes	Use MERGE INTO, INSERT OVERWRITE, or CRAS
Atomic transactions	Delta Lake ensures all-or-nothing writes
Checkpoint state	Streaming checkpoints for exactly-once guarantees
Retry safety	Operations that are safe to retry after failure
Incremental processing	Process only new/changed data to minimize blast radius
Data validation	Expectations and assertions catch issues before they propagate

On the Exam: Know which operations are idempotent (MERGE, INSERT OVERWRITE, CRAS, Auto Loader, COPY INTO) and which are not (INSERT INTO without dedup). Understand that checkpoints enable exactly-once streaming recovery.

Databricks Certified Data Engineer Associate

4.6 Designing Idempotent and Recoverable Pipelines

Key Takeaways

Designing Idempotent and Recoverable Pipelines

What Is Idempotency?

Idempotent Patterns

Pattern 1: MERGE INTO for Incremental Updates

Pattern 2: INSERT OVERWRITE for Full Refresh

Pattern 3: CREATE OR REPLACE TABLE

Anti-Pattern: Non-Idempotent INSERT INTO

Fix: Add Deduplication

Recoverability

Streaming Recovery

Batch Recovery

Design Principles for Production Pipelines

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

4.6 Designing Idempotent and Recoverable Pipelines

Key Takeaways

Designing Idempotent and Recoverable Pipelines

What Is Idempotency?

Idempotent Patterns

Pattern 1: MERGE INTO for Incremental Updates

Pattern 2: INSERT OVERWRITE for Full Refresh

Pattern 3: CREATE OR REPLACE TABLE

Anti-Pattern: Non-Idempotent INSERT INTO

Fix: Add Deduplication

Recoverability

Streaming Recovery

Batch Recovery

Design Principles for Production Pipelines