2.3 Writing Data to Delta Tables

Key Takeaways

  • INSERT INTO appends data to an existing Delta table, while INSERT OVERWRITE replaces all data in the table (or partition).
  • CREATE OR REPLACE TABLE (CRAS) atomically replaces the entire table definition and data in one operation.
  • MERGE INTO performs upsert operations — inserting new rows and updating existing rows in a single atomic transaction.
  • COPY INTO is a SQL command for idempotent file ingestion that tracks which files have already been loaded.
  • Write modes in PySpark include append, overwrite, error/errorifexists (default), and ignore.
Last updated: March 2026

Writing Data to Delta Tables

Quick Answer: Use INSERT INTO for appending, INSERT OVERWRITE for replacing, MERGE INTO for upserts, COPY INTO for file ingestion, and CREATE OR REPLACE TABLE for atomic table replacement. Each has specific use cases and behavior differences.

INSERT INTO (Append)

-- Append new rows to an existing table
INSERT INTO my_catalog.my_schema.orders
VALUES (1001, 'Alice', 299.99, '2026-03-31');

-- Append query results
INSERT INTO my_catalog.my_schema.orders
SELECT * FROM staging.new_orders;
# PySpark append
df.write.mode("append").saveAsTable("my_catalog.my_schema.orders")
  • Adds new rows to the table
  • Does NOT check for duplicates
  • Schema must match the target table

INSERT OVERWRITE (Replace Data)

-- Replace ALL data in the table with new data
INSERT OVERWRITE my_catalog.my_schema.orders
SELECT * FROM staging.refreshed_orders;
# PySpark overwrite
df.write.mode("overwrite").saveAsTable("my_catalog.my_schema.orders")
  • Replaces all existing data with the new data
  • The table schema is preserved (unlike CREATE OR REPLACE TABLE)
  • Can overwrite specific partitions with dynamic partition overwrite

Dynamic Partition Overwrite

-- Only overwrite partitions present in the new data
SET spark.sql.sources.partitionOverwriteMode = dynamic;

INSERT OVERWRITE my_catalog.my_schema.orders
SELECT * FROM staging.updated_orders;

CREATE OR REPLACE TABLE (CRAS)

-- Atomically replace the entire table (schema + data)
CREATE OR REPLACE TABLE my_catalog.my_schema.orders AS
SELECT * FROM staging.new_data;
  • Replaces both the schema and the data in one atomic operation
  • If the table does not exist, it creates it
  • Keeps the table history (previous versions accessible via time travel)
  • Useful for full table refreshes where the schema might change
OperationReplaces Data?Replaces Schema?Creates Table?
INSERT INTONo (appends)NoNo
INSERT OVERWRITEYesNoNo
CREATE OR REPLACE TABLEYesYesYes (if needed)

MERGE INTO (Upsert)

MERGE INTO is the most powerful write operation — it combines INSERT, UPDATE, and DELETE in a single atomic transaction:

MERGE INTO my_catalog.my_schema.target AS t
USING my_catalog.my_schema.source AS s
ON t.id = s.id
WHEN MATCHED AND s.is_deleted = true THEN DELETE
WHEN MATCHED THEN UPDATE SET
    t.name = s.name,
    t.amount = s.amount,
    t.updated_at = current_timestamp()
WHEN NOT MATCHED THEN INSERT (id, name, amount, updated_at)
VALUES (s.id, s.name, s.amount, current_timestamp());

MERGE Clauses

ClauseWhen UsedAction
WHEN MATCHEDSource row matches target row (ON condition)UPDATE or DELETE
WHEN NOT MATCHEDSource row has no match in targetINSERT
WHEN NOT MATCHED BY SOURCETarget row has no match in sourceUPDATE or DELETE

Common MERGE Patterns

Simple Upsert (Insert or Update):

MERGE INTO target t USING source s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Deduplication on Insert:

MERGE INTO target t USING source s ON t.id = s.id
WHEN NOT MATCHED THEN INSERT *;
-- Only inserts rows that don't already exist

SCD Type 1 (Overwrite with latest):

MERGE INTO customers t USING updates s ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

On the Exam: MERGE INTO is heavily tested. Know all three clauses (MATCHED, NOT MATCHED, NOT MATCHED BY SOURCE) and common patterns like upserts, deduplication, and SCD Type 1.

COPY INTO (Idempotent File Ingestion)

COPY INTO my_catalog.my_schema.raw_events
FROM '/data/raw/events/'
FILEFORMAT = JSON
FORMAT_OPTIONS ('mergeSchema' = 'true', 'inferColumnTypes' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
  • SQL-based file ingestion command
  • Idempotent: Files are tracked and not re-processed on subsequent runs
  • Supports JSON, CSV, Parquet, Avro, ORC, text, and binary files
  • Less powerful than Auto Loader but simpler for SQL-only environments

PySpark Write Modes

ModeBehavior
appendAdd new rows to the table
overwriteReplace all existing data
error / errorifexistsFail if the table already exists (default)
ignoreDo nothing if the table already exists
Test Your Knowledge

A data engineer needs to update existing records and insert new records from a source table into a target table in a single atomic operation. Which command should they use?

A
B
C
D
Test Your Knowledge

What is the key difference between INSERT OVERWRITE and CREATE OR REPLACE TABLE?

A
B
C
D
Test Your Knowledge

The COPY INTO command is run multiple times against the same source directory. What happens to files that were already processed in previous runs?

A
B
C
D
Test Your Knowledge

Which MERGE INTO clause handles rows that exist in the target table but have no matching row in the source?

A
B
C
D