1.3 Delta Lake Fundamentals

Key Takeaways

A Delta table is Parquet data files plus a _delta_log directory of JSON commits that provide ACID transactions.
Each transaction adds an ordered JSON commit recording add/remove file actions; every 10 commits a Parquet checkpoint speeds log reads.
Time travel lets you query prior versions by VERSION AS OF or TIMESTAMP AS OF, useful for audits and rollbacks.
Schema enforcement rejects writes that do not match the table schema; mergeSchema / overwriteSchema permit intentional evolution.
OPTIMIZE compacts small files (with ZORDER for data skipping), VACUUM removes unreferenced files, and MERGE performs upserts.

Last updated: June 2026

Anatomy of a Delta Table

A Delta table is just two things in object storage: the actual data as Parquet files, and a _delta_log/ directory — the transaction log — that records every change. The log is what upgrades plain files into a reliable, ACID-compliant table.

Each write creates an ordered commit file, 00000000000000000000.json, ...0001.json, and so on. A commit does not rewrite data; it records actions: add (a new data file is part of the table), remove (a file is logically deleted), plus metadata and protocol entries. The current table state is the replay of all commits in order. Because readers determine which files belong to a version from the log — not from listing the directory — Delta gives:

Atomicity: a commit is all-or-nothing; partial writes never appear.
Consistency & Isolation: readers see a consistent snapshot; concurrent writers use optimistic concurrency and a commit conflict aborts the loser.
Durability: committed data persists in cloud storage.

To avoid replaying thousands of JSONs, Delta writes a checkpoint (a Parquet summary of state) every 10 commits, so readers load the latest checkpoint plus any newer JSON commits.

Time Travel

Because old files stay referenced in the log, you can query prior versions:

SELECT * FROM sales VERSION AS OF 12;
SELECT * FROM sales TIMESTAMP AS OF '2026-06-01';

Use DESCRIBE HISTORY sales to see the version, timestamp, and operation for each commit. Time travel powers audits, reproducible reports, and rollbacks (RESTORE TABLE sales TO VERSION AS OF 12). Note: once VACUUM removes the underlying files, time travel to those older versions fails with file-not-found.

Schema Enforcement and Evolution

Delta tables have a defined schema, and schema enforcement (schema-on-write) rejects any write whose columns or types do not match — protecting downstream consumers from corruption. When a change is intentional, schema evolution lets it through:

Goal	Mechanism
Add new columns from an append	`.option("mergeSchema", "true")` or `spark.databricks.delta.schema.autoMerge.enabled`
Replace the schema on overwrite	`.option("overwriteSchema", "true")`
Evolve schema during a MERGE	`WITH SCHEMA EVOLUTION` clause

Without these options, a mismatched write fails fast rather than silently dropping or coercing data — a frequent exam point.

MERGE: Upserts in One Statement

MERGE combines insert, update, and delete against a target using a source and a match condition — the standard pattern for change-data-capture and Silver-layer upserts:

MERGE INTO customers t
USING updates s ON t.id = s.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

This updates millions of rows as simply as a database UPDATE, all within one atomic commit.

Table Maintenance: OPTIMIZE, ZORDER, VACUUM

Streaming and frequent small writes create the small-file problem, hurting read performance. Maintenance commands fix layout and reclaim space:

OPTIMIZE table — compacts many small Parquet files into fewer right-sized files (no data change).
OPTIMIZE table ZORDER BY (col) — co-locates rows sharing column values so queries filtering on col skip far more files (data skipping). Best for high-cardinality filter columns.
Liquid clustering (CLUSTER BY) — the modern, stateful alternative to partitioning and ZORDER; it incrementally maintains layout as data evolves without recomputing the whole table.
VACUUM table — permanently deletes data files no longer referenced by the log, older than the retention threshold (default 7 days). VACUUM frees storage but breaks time travel to versions whose files it removes.

A common trap: running VACUUM ... RETAIN 0 HOURS deletes files immediately and can corrupt readers/time travel; Databricks guards this with a safety check you must explicitly disable.

Partitioning, Clustering, and Data Skipping

Delta accelerates reads by skipping files it can prove are irrelevant. For each data file, Delta records min/max statistics per column in the log; a query with a WHERE filter consults these stats and reads only files whose ranges overlap the predicate. Three layout strategies improve this:

Strategy	What it does	When to use
Partitioning	Splits data into directories by a low-cardinality column (e.g., `date`)	Large tables with a clear, low-cardinality filter
Z-ordering	Co-locates related values across multiple columns within files	High-cardinality filter columns; run via OPTIMIZE
Liquid clustering	Stateful, incremental clustering on chosen keys	Modern default; avoids over-partitioning and reclustering cost

Over-partitioning on a high-cardinality column is a classic anti-pattern: it produces thousands of tiny files and hurts performance. Liquid clustering is now preferred because it adapts as data grows without the rigid directory layout of partitioning or the full-table recompute that Z-ordering can require.

Managed vs External Tables and Constraints

Delta tables come in two storage flavors under Unity Catalog. A managed table has both its metadata and its data files managed by Databricks in the catalog's storage location; dropping it deletes the data. An external table stores data at a path you specify (an external location); dropping it removes only the metadata, leaving the files in place. Delta also supports CHECK constraints and NOT NULL to reject invalid rows at write time, and generated columns computed from other columns.

Why Delta Beats Plain Parquet

Plain Parquet on a lake has no atomic commits, so a failed multi-file write leaves the table half-updated and readers can see partial results. Delta's transaction log makes every write atomic and isolated, adds time travel and schema enforcement, and enables efficient upserts, deletes, and updates that raw Parquet cannot express. That reliability is precisely why every managed table on Databricks defaults to Delta.

Test Your Knowledge

What does the _delta_log directory of a Delta table contain, and why is it essential?

The compressed Parquet data files that hold the table's rows

A backup copy of the data used only for disaster recovery

Cluster configuration settings for the jobs that wrote the table

An ordered set of JSON commit files recording add/remove file actions, which provides ACID transactions by replaying table state

Test Your Knowledge

A query needs the state of the orders table as it existed yesterday for an audit. Which feature provides this?

OPTIMIZE with ZORDER

VACUUM with a 0-hour retention

Time travel using TIMESTAMP AS OF

Schema evolution with mergeSchema

Test Your Knowledge

An append job tries to write a DataFrame that has an extra column not in the target Delta table, and no schema options are set. What happens?

The write fails because schema enforcement rejects the mismatched columns

The extra column is silently dropped and the write succeeds

The table is automatically overwritten with the new schema

All existing rows are deleted to match the new schema

Test Your Knowledge

Which maintenance command compacts many small Parquet files into fewer, larger files to improve read performance without changing the data?

VACUUM

MERGE

RESTORE

OPTIMIZE

Up Next

1.4 Databricks Repos, Secrets, and Development Tools

Continue learning

Databricks Certified Data Engineer Associate

Databricks Certified Data Engineer Associate

1.3 Delta Lake Fundamentals

Key Takeaways

Anatomy of a Delta Table

Time Travel

Schema Enforcement and Evolution

MERGE: Upserts in One Statement

Table Maintenance: OPTIMIZE, ZORDER, VACUUM

Partitioning, Clustering, and Data Skipping

Managed vs External Tables and Constraints

Why Delta Beats Plain Parquet

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

Databricks Certified Data Engineer Associate

1.3 Delta Lake Fundamentals

Key Takeaways

Anatomy of a Delta Table

Time Travel

Schema Enforcement and Evolution

MERGE: Upserts in One Statement

Table Maintenance: OPTIMIZE, ZORDER, VACUUM

Partitioning, Clustering, and Data Skipping

Managed vs External Tables and Constraints

Why Delta Beats Plain Parquet