A data engineer needs to update existing records and insert new records from a source table into a target table in a single atomic operation. Which command should they use?

MERGE INTO. MERGE INTO combines INSERT, UPDATE, and DELETE operations in a single atomic transaction. It uses WHEN MATCHED (update or delete existing rows) and WHEN NOT MATCHED (insert new rows) clauses based on a join condition.

What is the key difference between INSERT OVERWRITE and CREATE OR REPLACE TABLE?

INSERT OVERWRITE replaces data but preserves the schema, while CREATE OR REPLACE TABLE replaces both the schema and data. INSERT OVERWRITE replaces the data in a table but keeps the existing schema unchanged. CREATE OR REPLACE TABLE (CRAS) replaces both the schema and the data atomically, allowing the table structure to change.

The COPY INTO command is run multiple times against the same source directory. What happens to files that were already processed in previous runs?

They are skipped because COPY INTO is idempotent. COPY INTO is idempotent — it tracks which files have been processed and skips them on subsequent runs. This makes it safe to run repeatedly without creating duplicate records.

Which MERGE INTO clause handles rows that exist in the target table but have no matching row in the source?

WHEN NOT MATCHED BY SOURCE. WHEN NOT MATCHED BY SOURCE handles target rows that have no corresponding source row. This is useful for deleting or updating records in the target that are no longer present in the source data. WHEN MATCHED handles rows in both, and WHEN NOT MATCHED handles source-only rows.

Writing Data to Delta Tables

Quick Answer: Use INSERT INTO for appending, INSERT OVERWRITE for replacing, MERGE INTO for upserts, COPY INTO for file ingestion, and CREATE OR REPLACE TABLE for atomic table replacement. Each has specific use cases and behavior differences.

INSERT INTO (Append)

-- Append new rows to an existing table
INSERT INTO my_catalog.my_schema.orders
VALUES (1001, 'Alice', 299.99, '2026-03-31');

-- Append query results
INSERT INTO my_catalog.my_schema.orders
SELECT * FROM staging.new_orders;

# PySpark append
df.write.mode("append").saveAsTable("my_catalog.my_schema.orders")

Adds new rows to the table
Does NOT check for duplicates
Schema must match the target table

INSERT OVERWRITE (Replace Data)

-- Replace ALL data in the table with new data
INSERT OVERWRITE my_catalog.my_schema.orders
SELECT * FROM staging.refreshed_orders;

# PySpark overwrite
df.write.mode("overwrite").saveAsTable("my_catalog.my_schema.orders")

Replaces all existing data with the new data
The table schema is preserved (unlike CREATE OR REPLACE TABLE)
Can overwrite specific partitions with dynamic partition overwrite

Dynamic Partition Overwrite

-- Only overwrite partitions present in the new data
SET spark.sql.sources.partitionOverwriteMode = dynamic;

INSERT OVERWRITE my_catalog.my_schema.orders
SELECT * FROM staging.updated_orders;

CREATE OR REPLACE TABLE (CRAS)

-- Atomically replace the entire table (schema + data)
CREATE OR REPLACE TABLE my_catalog.my_schema.orders AS
SELECT * FROM staging.new_data;

Replaces both the schema and the data in one atomic operation
If the table does not exist, it creates it
Keeps the table history (previous versions accessible via time travel)
Useful for full table refreshes where the schema might change

Operation	Replaces Data?	Replaces Schema?	Creates Table?
INSERT INTO	No (appends)	No	No
INSERT OVERWRITE	Yes	No	No
CREATE OR REPLACE TABLE	Yes	Yes	Yes (if needed)

MERGE INTO (Upsert)

MERGE INTO is the most powerful write operation — it combines INSERT, UPDATE, and DELETE in a single atomic transaction:

MERGE INTO my_catalog.my_schema.target AS t
USING my_catalog.my_schema.source AS s
ON t.id = s.id
WHEN MATCHED AND s.is_deleted = true THEN DELETE
WHEN MATCHED THEN UPDATE SET
    t.name = s.name,
    t.amount = s.amount,
    t.updated_at = current_timestamp()
WHEN NOT MATCHED THEN INSERT (id, name, amount, updated_at)
VALUES (s.id, s.name, s.amount, current_timestamp());

MERGE Clauses

Clause	When Used	Action
WHEN MATCHED	Source row matches target row (ON condition)	UPDATE or DELETE
WHEN NOT MATCHED	Source row has no match in target	INSERT
WHEN NOT MATCHED BY SOURCE	Target row has no match in source	UPDATE or DELETE

Common MERGE Patterns

Simple Upsert (Insert or Update):

MERGE INTO target t USING source s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Deduplication on Insert:

MERGE INTO target t USING source s ON t.id = s.id
WHEN NOT MATCHED THEN INSERT *;
-- Only inserts rows that don't already exist

SCD Type 1 (Overwrite with latest):

MERGE INTO customers t USING updates s ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

On the Exam: MERGE INTO is heavily tested. Know all three clauses (MATCHED, NOT MATCHED, NOT MATCHED BY SOURCE) and common patterns like upserts, deduplication, and SCD Type 1.

COPY INTO (Idempotent File Ingestion)

COPY INTO my_catalog.my_schema.raw_events
FROM '/data/raw/events/'
FILEFORMAT = JSON
FORMAT_OPTIONS ('mergeSchema' = 'true', 'inferColumnTypes' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');

SQL-based file ingestion command
Idempotent: Files are tracked and not re-processed on subsequent runs
Supports JSON, CSV, Parquet, Avro, ORC, text, and binary files
Less powerful than Auto Loader but simpler for SQL-only environments

PySpark Write Modes

Mode	Behavior
append	Add new rows to the table
overwrite	Replace all existing data
error / errorifexists	Fail if the table already exists (default)
ignore	Do nothing if the table already exists

Databricks Certified Data Engineer Associate

2.3 Writing Data to Delta Tables

Key Takeaways

Writing Data to Delta Tables

INSERT INTO (Append)

INSERT OVERWRITE (Replace Data)

Dynamic Partition Overwrite

CREATE OR REPLACE TABLE (CRAS)

MERGE INTO (Upsert)

MERGE Clauses

Common MERGE Patterns

COPY INTO (Idempotent File Ingestion)

PySpark Write Modes

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

2.3 Writing Data to Delta Tables

Key Takeaways

Writing Data to Delta Tables

INSERT INTO (Append)

INSERT OVERWRITE (Replace Data)

Dynamic Partition Overwrite

CREATE OR REPLACE TABLE (CRAS)

MERGE INTO (Upsert)

MERGE Clauses

Common MERGE Patterns

COPY INTO (Idempotent File Ingestion)

PySpark Write Modes