2.3 Writing Data to Delta Tables
Key Takeaways
- DataFrameWriter.saveAsTable() registers a table in the catalog/metastore, while save(path) only writes files to a location without a catalog entry.
- Save modes are append, overwrite, ignore, and errorIfExists (the default); overwrite replaces all data, append adds rows.
- MERGE INTO performs upserts in one atomic operation using WHEN MATCHED / WHEN NOT MATCHED clauses, ideal for change-data-capture and deduplication.
- INSERT OVERWRITE rewrites a table's data while preserving its schema/identity, whereas CREATE OR REPLACE TABLE replaces the table object entirely.
- Delta writes are ACID: concurrent readers always see a consistent snapshot and partial writes never become visible.
saveAsTable vs save
The DataFrameWriter (returned by df.write) persists a DataFrame. Two terminal methods differ in whether the catalog learns about the table:
| Method | Result |
|---|---|
.saveAsTable("cat.schema.tbl") | Writes data and registers a table in the metastore/Unity Catalog so it is queryable by name |
.save("/path") | Writes Delta files to a path only; no catalog entry, query via delta.\/path`` |
(df.write.format("delta")
.mode("append")
.saveAsTable("main.silver.orders"))
Use saveAsTable for governed, name-addressable tables (the normal case on Databricks). Use save only when you deliberately want path-based files outside the catalog. Delta is the default provider for saveAsTable on Databricks, so .format("delta") is often optional.
Save Modes
The save mode controls what happens when the target already has data:
| Mode | Behavior |
|---|---|
append | Add new rows to existing data |
overwrite | Replace all existing data (schema kept unless overwriteSchema set) |
ignore | Silently do nothing if the target exists |
errorIfExists | Default — throw an error if the target exists |
df.write.mode("overwrite").saveAsTable("main.bronze.daily")
In SQL the parallel statements are INSERT INTO (append) and INSERT OVERWRITE (replace data, preserve the table object and schema). A related but different statement is CREATE OR REPLACE TABLE (often abbreviated CRAS when paired with a SELECT), which atomically replaces the entire table definition. Choosing INSERT OVERWRITE over CREATE OR REPLACE preserves the table's history and identity while swapping its contents — important for downstream readers and time travel.
MERGE INTO for Upserts
MERGE INTO is the workhorse for upserts (update-or-insert), CDC, and deduplication. It matches a target table against a source on a condition and applies different actions per match outcome — all in a single atomic transaction:
MERGE INTO main.silver.customers AS t
USING updates AS s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
WHEN MATCHED— rows present in both: usuallyUPDATE SETorDELETE.WHEN NOT MATCHED— rows only in the source:INSERT.WHEN NOT MATCHED BY SOURCE— rows only in the target: useful for deletes in full-refresh syncs.
MERGE is atomic and idempotent when the join key is unique, making it the standard pattern for incrementally applying changes to a silver/gold table. Because Delta is ACID, concurrent readers never see a half-applied merge — they read the snapshot before or after the commit, never in between, and a failed write leaves no partial files visible.
Schema Enforcement, Evolution, and Partitioning
Delta enforces schema-on-write: by default a write whose columns or types do not match the table is rejected, preventing silent corruption. When you genuinely intend to change the shape, opt in explicitly:
| Need | How |
|---|---|
| Add new columns on append | .option("mergeSchema", "true") |
| Replace schema on overwrite | .option("overwriteSchema", "true") |
| Append-only protection | TBLPROPERTIES (delta.appendOnly = true) |
Without mergeSchema, appending a DataFrame that has an extra column raises an AnalysisException — a guardrail the exam expects you to recognize.
partitionBy("date")(SQL:PARTITIONED BY), which lays data out in per-value directories so that filtering on the partition column prunes whole directories. Over-partitioning on a high-cardinality column creates many tiny files and hurts performance, so partition on low-cardinality, frequently-filtered columns only; modern Delta often prefers **liquid clustering** or Z-ordering over manual partitioning. foreachBatch() lets you run arbitrary batch logic — including a MERGE — on each micro-batch, which is the canonical way to do streaming upserts into a Delta table.
Whichever method you use, every Delta write produces a new atomic commit in the _delta_log, enabling time travel back to any prior version with VERSION AS OF.
CTAS, INSERT, and the Transaction Log
In SQL, the write paths mirror the DataFrame API. CREATE TABLE AS SELECT (CTAS) creates and populates a table from a query in one atomic statement, inferring the schema from the query result. INSERT INTO target SELECT ... appends, and INSERT OVERWRITE target SELECT ... replaces the data. A subtle rule: INSERT INTO matches columns by position, so column order in the SELECT matters, whereas a DataFrame saveAsTable matches by name — a common source of silent data-misplacement bugs the exam may probe.
Every successful write appends a JSON commit to the Delta transaction log (_delta_log), which records the added and removed files for that version. This log is what delivers Delta's headline features:
| Feature | Enabled by |
|---|---|
| ACID transactions | Atomic log commits |
| Time travel | Versioned log history (VERSION AS OF / TIMESTAMP AS OF) |
| Audit history | DESCRIBE HISTORY |
| Concurrent safety | Optimistic concurrency on commits |
Because readers resolve the table by replaying the log up to the latest committed version, a writer can add files while readers continue on the prior snapshot, and a failed write simply never commits — leaving orphaned files that VACUUM later cleans up. This log-centric design is why Delta is the default sink for every write pattern on Databricks, from a one-off CTAS to a continuous streaming MERGE.
What is the difference between df.write.saveAsTable("sales") and df.write.save("/mnt/sales")?
Which save mode is the default when none is specified, throwing an error if the target already exists?
A pipeline must apply incoming changes by updating existing customer rows and inserting brand-new ones in a single atomic step. Which statement fits best?
Why does INSERT OVERWRITE often beat CREATE OR REPLACE TABLE when refreshing a table's data?