2.3 Writing Data to Delta Tables

Key Takeaways

DataFrameWriter.saveAsTable() registers a table in the catalog/metastore, while save(path) only writes files to a location without a catalog entry.
Save modes are append, overwrite, ignore, and errorIfExists (the default); overwrite replaces all data, append adds rows.
MERGE INTO performs upserts in one atomic operation using WHEN MATCHED / WHEN NOT MATCHED clauses, ideal for change-data-capture and deduplication.
INSERT OVERWRITE rewrites a table's data while preserving its schema/identity, whereas CREATE OR REPLACE TABLE replaces the table object entirely.
Delta writes are ACID: concurrent readers always see a consistent snapshot and partial writes never become visible.

Last updated: June 2026

saveAsTable vs save

The DataFrameWriter (returned by df.write) persists a DataFrame. Two terminal methods differ in whether the catalog learns about the table:

Method	Result
`.saveAsTable("cat.schema.tbl")`	Writes data and registers a table in the metastore/Unity Catalog so it is queryable by name
`.save("/path")`	Writes Delta files to a path only; no catalog entry, query via `delta.\`/path``

(df.write.format("delta")
  .mode("append")
  .saveAsTable("main.silver.orders"))

Use saveAsTable for governed, name-addressable tables (the normal case on Databricks). Use save only when you deliberately want path-based files outside the catalog. Delta is the default provider for saveAsTable on Databricks, so .format("delta") is often optional.

Save Modes

The save mode controls what happens when the target already has data:

Mode	Behavior
`append`	Add new rows to existing data
`overwrite`	Replace all existing data (schema kept unless overwriteSchema set)
`ignore`	Silently do nothing if the target exists
`errorIfExists`	Default — throw an error if the target exists

df.write.mode("overwrite").saveAsTable("main.bronze.daily")

In SQL the parallel statements are INSERT INTO (append) and INSERT OVERWRITE (replace data, preserve the table object and schema). A related but different statement is CREATE OR REPLACE TABLE (often abbreviated CRAS when paired with a SELECT), which atomically replaces the entire table definition. Choosing INSERT OVERWRITE over CREATE OR REPLACE preserves the table's history and identity while swapping its contents — important for downstream readers and time travel.

MERGE INTO for Upserts

MERGE INTO is the workhorse for upserts (update-or-insert), CDC, and deduplication. It matches a target table against a source on a condition and applies different actions per match outcome — all in a single atomic transaction:

MERGE INTO main.silver.customers AS t
USING updates AS s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

WHEN MATCHED — rows present in both: usually UPDATE SET or DELETE.
WHEN NOT MATCHED — rows only in the source: INSERT.
WHEN NOT MATCHED BY SOURCE — rows only in the target: useful for deletes in full-refresh syncs.

MERGE is atomic and idempotent when the join key is unique, making it the standard pattern for incrementally applying changes to a silver/gold table. Because Delta is ACID, concurrent readers never see a half-applied merge — they read the snapshot before or after the commit, never in between, and a failed write leaves no partial files visible.

Schema Enforcement, Evolution, and Partitioning

Delta enforces schema-on-write: by default a write whose columns or types do not match the table is rejected, preventing silent corruption. When you genuinely intend to change the shape, opt in explicitly:

Need	How
Add new columns on append	`.option("mergeSchema", "true")`
Replace schema on overwrite	`.option("overwriteSchema", "true")`
Append-only protection	`TBLPROPERTIES (delta.appendOnly = true)`

Without mergeSchema, appending a DataFrame that has an extra column raises an AnalysisException — a guardrail the exam expects you to recognize.

partitionBy("date")(SQL:PARTITIONED BY), which lays data out in per-value directories so that filtering on the partition column prunes whole directories. Over-partitioning on a high-cardinality column creates many tiny files and hurts performance, so partition on low-cardinality, frequently-filtered columns only; modern Delta often prefers **liquid clustering** or Z-ordering over manual partitioning. foreachBatch() lets you run arbitrary batch logic — including a MERGE — on each micro-batch, which is the canonical way to do streaming upserts into a Delta table.

Whichever method you use, every Delta write produces a new atomic commit in the _delta_log, enabling time travel back to any prior version with VERSION AS OF.

CTAS, INSERT, and the Transaction Log

In SQL, the write paths mirror the DataFrame API. CREATE TABLE AS SELECT (CTAS) creates and populates a table from a query in one atomic statement, inferring the schema from the query result. INSERT INTO target SELECT ... appends, and INSERT OVERWRITE target SELECT ... replaces the data. A subtle rule: INSERT INTO matches columns by position, so column order in the SELECT matters, whereas a DataFrame saveAsTable matches by name — a common source of silent data-misplacement bugs the exam may probe.

Every successful write appends a JSON commit to the Delta transaction log (_delta_log), which records the added and removed files for that version. This log is what delivers Delta's headline features:

Feature	Enabled by
ACID transactions	Atomic log commits
Time travel	Versioned log history (`VERSION AS OF` / `TIMESTAMP AS OF`)
Audit history	`DESCRIBE HISTORY`
Concurrent safety	Optimistic concurrency on commits

Because readers resolve the table by replaying the log up to the latest committed version, a writer can add files while readers continue on the prior snapshot, and a failed write simply never commits — leaving orphaned files that VACUUM later cleans up. This log-centric design is why Delta is the default sink for every write pattern on Databricks, from a one-off CTAS to a continuous streaming MERGE.

Test Your Knowledge

What is the difference between df.write.saveAsTable("sales") and df.write.save("/mnt/sales")?

saveAsTable registers the table in the catalog; save only writes files to a path with no catalog entry

saveAsTable writes CSV while save writes Delta

save is transactional but saveAsTable is not

There is no difference; they are aliases

Test Your Knowledge

Which save mode is the default when none is specified, throwing an error if the target already exists?

append

overwrite

ignore

errorIfExists

Test Your Knowledge

A pipeline must apply incoming changes by updating existing customer rows and inserting brand-new ones in a single atomic step. Which statement fits best?

INSERT INTO customers

CREATE OR REPLACE TABLE customers AS SELECT * FROM updates

MERGE INTO customers USING updates ON ... WHEN MATCHED THEN UPDATE WHEN NOT MATCHED THEN INSERT

COPY INTO customers

Test Your Knowledge

Why does INSERT OVERWRITE often beat CREATE OR REPLACE TABLE when refreshing a table's data?

INSERT OVERWRITE deletes the underlying files permanently

INSERT OVERWRITE replaces the data while preserving the table object, its schema, and its history

CREATE OR REPLACE TABLE cannot be used with a SELECT

INSERT OVERWRITE bypasses the transaction log for speed

Up Next

2.4 Spark SQL Query Fundamentals

Continue learning

Databricks Certified Data Engineer Associate

Databricks Certified Data Engineer Associate

2.3 Writing Data to Delta Tables

Key Takeaways

saveAsTable vs save

Save Modes

MERGE INTO for Upserts

Schema Enforcement, Evolution, and Partitioning

CTAS, INSERT, and the Transaction Log

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

Databricks Certified Data Engineer Associate

2.3 Writing Data to Delta Tables

Key Takeaways

saveAsTable vs save

Save Modes

MERGE INTO for Upserts

Schema Enforcement, Evolution, and Partitioning

CTAS, INSERT, and the Transaction Log