2.3 Writing Data to Delta Tables
Key Takeaways
- INSERT INTO appends data to an existing Delta table, while INSERT OVERWRITE replaces all data in the table (or partition).
- CREATE OR REPLACE TABLE (CRAS) atomically replaces the entire table definition and data in one operation.
- MERGE INTO performs upsert operations — inserting new rows and updating existing rows in a single atomic transaction.
- COPY INTO is a SQL command for idempotent file ingestion that tracks which files have already been loaded.
- Write modes in PySpark include append, overwrite, error/errorifexists (default), and ignore.
Writing Data to Delta Tables
Quick Answer: Use INSERT INTO for appending, INSERT OVERWRITE for replacing, MERGE INTO for upserts, COPY INTO for file ingestion, and CREATE OR REPLACE TABLE for atomic table replacement. Each has specific use cases and behavior differences.
INSERT INTO (Append)
-- Append new rows to an existing table
INSERT INTO my_catalog.my_schema.orders
VALUES (1001, 'Alice', 299.99, '2026-03-31');
-- Append query results
INSERT INTO my_catalog.my_schema.orders
SELECT * FROM staging.new_orders;
# PySpark append
df.write.mode("append").saveAsTable("my_catalog.my_schema.orders")
- Adds new rows to the table
- Does NOT check for duplicates
- Schema must match the target table
INSERT OVERWRITE (Replace Data)
-- Replace ALL data in the table with new data
INSERT OVERWRITE my_catalog.my_schema.orders
SELECT * FROM staging.refreshed_orders;
# PySpark overwrite
df.write.mode("overwrite").saveAsTable("my_catalog.my_schema.orders")
- Replaces all existing data with the new data
- The table schema is preserved (unlike CREATE OR REPLACE TABLE)
- Can overwrite specific partitions with dynamic partition overwrite
Dynamic Partition Overwrite
-- Only overwrite partitions present in the new data
SET spark.sql.sources.partitionOverwriteMode = dynamic;
INSERT OVERWRITE my_catalog.my_schema.orders
SELECT * FROM staging.updated_orders;
CREATE OR REPLACE TABLE (CRAS)
-- Atomically replace the entire table (schema + data)
CREATE OR REPLACE TABLE my_catalog.my_schema.orders AS
SELECT * FROM staging.new_data;
- Replaces both the schema and the data in one atomic operation
- If the table does not exist, it creates it
- Keeps the table history (previous versions accessible via time travel)
- Useful for full table refreshes where the schema might change
| Operation | Replaces Data? | Replaces Schema? | Creates Table? |
|---|---|---|---|
| INSERT INTO | No (appends) | No | No |
| INSERT OVERWRITE | Yes | No | No |
| CREATE OR REPLACE TABLE | Yes | Yes | Yes (if needed) |
MERGE INTO (Upsert)
MERGE INTO is the most powerful write operation — it combines INSERT, UPDATE, and DELETE in a single atomic transaction:
MERGE INTO my_catalog.my_schema.target AS t
USING my_catalog.my_schema.source AS s
ON t.id = s.id
WHEN MATCHED AND s.is_deleted = true THEN DELETE
WHEN MATCHED THEN UPDATE SET
t.name = s.name,
t.amount = s.amount,
t.updated_at = current_timestamp()
WHEN NOT MATCHED THEN INSERT (id, name, amount, updated_at)
VALUES (s.id, s.name, s.amount, current_timestamp());
MERGE Clauses
| Clause | When Used | Action |
|---|---|---|
| WHEN MATCHED | Source row matches target row (ON condition) | UPDATE or DELETE |
| WHEN NOT MATCHED | Source row has no match in target | INSERT |
| WHEN NOT MATCHED BY SOURCE | Target row has no match in source | UPDATE or DELETE |
Common MERGE Patterns
Simple Upsert (Insert or Update):
MERGE INTO target t USING source s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
Deduplication on Insert:
MERGE INTO target t USING source s ON t.id = s.id
WHEN NOT MATCHED THEN INSERT *;
-- Only inserts rows that don't already exist
SCD Type 1 (Overwrite with latest):
MERGE INTO customers t USING updates s ON t.customer_id = s.customer_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
On the Exam: MERGE INTO is heavily tested. Know all three clauses (MATCHED, NOT MATCHED, NOT MATCHED BY SOURCE) and common patterns like upserts, deduplication, and SCD Type 1.
COPY INTO (Idempotent File Ingestion)
COPY INTO my_catalog.my_schema.raw_events
FROM '/data/raw/events/'
FILEFORMAT = JSON
FORMAT_OPTIONS ('mergeSchema' = 'true', 'inferColumnTypes' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');
- SQL-based file ingestion command
- Idempotent: Files are tracked and not re-processed on subsequent runs
- Supports JSON, CSV, Parquet, Avro, ORC, text, and binary files
- Less powerful than Auto Loader but simpler for SQL-only environments
PySpark Write Modes
| Mode | Behavior |
|---|---|
| append | Add new rows to the table |
| overwrite | Replace all existing data |
| error / errorifexists | Fail if the table already exists (default) |
| ignore | Do nothing if the table already exists |
A data engineer needs to update existing records and insert new records from a source table into a target table in a single atomic operation. Which command should they use?
What is the key difference between INSERT OVERWRITE and CREATE OR REPLACE TABLE?
The COPY INTO command is run multiple times against the same source directory. What happens to files that were already processed in previous runs?
Which MERGE INTO clause handles rows that exist in the target table but have no matching row in the source?