1.3 Delta Lake Fundamentals
Key Takeaways
- Delta Lake is an open-source storage layer that brings ACID transactions and reliability to data lakes.
- Every Delta table has a transaction log (_delta_log directory) that records every change as an ordered, atomic commit.
- Delta Lake provides four key guarantees: atomicity, consistency, isolation, and durability (ACID).
- Time travel allows querying any historical version of a Delta table using VERSION AS OF or TIMESTAMP AS OF syntax.
- The DESCRIBE HISTORY command shows the full audit log of all operations performed on a Delta table.
Delta Lake Fundamentals
Quick Answer: Delta Lake is an open-source storage layer that adds ACID transactions, schema enforcement, and time travel to data stored in Parquet format on cloud object storage. Every change is recorded in a transaction log (_delta_log), enabling reliable, auditable data management.
What Is Delta Lake?
Delta Lake is the default table format in Databricks. When you create a table without specifying a format, it is automatically a Delta table. Delta Lake adds a transaction log on top of Parquet files stored in cloud object storage.
Delta Lake = Parquet Files + Transaction Log
my_table/
├── _delta_log/ # Transaction log directory
│ ├── 00000000000000000000.json # Commit 0
│ ├── 00000000000000000001.json # Commit 1
│ ├── 00000000000000000002.json # Commit 2
│ └── 00000000000000000010.checkpoint.parquet # Checkpoint
├── part-00000-xxxx.parquet # Data file
├── part-00001-xxxx.parquet # Data file
└── part-00002-xxxx.parquet # Data file
ACID Transactions
Delta Lake provides full ACID compliance:
| Property | Guarantee | Example |
|---|---|---|
| Atomicity | Transactions fully complete or fully fail — no partial writes | A multi-file write either adds all files or none |
| Consistency | Data always meets defined constraints after each transaction | Schema enforcement rejects invalid writes |
| Isolation | Concurrent operations don't interfere with each other | Readers see a consistent snapshot even during writes |
| Durability | Committed changes are permanent and survive failures | Committed data persists in cloud object storage |
How ACID Works in Delta Lake
- Write operations create new Parquet data files
- A new JSON commit is atomically added to the transaction log
- The commit references which files to add and which to remove
- Readers always see the latest committed version (snapshot isolation)
- Failed writes leave no trace — uncommitted files are eventually cleaned up
On the Exam: A common question pattern asks what happens when a write fails midway. The answer is always: the transaction is atomic, so readers see either the fully completed write or the previous version. There is no corrupted or partial state.
The Transaction Log (_delta_log)
The transaction log is a directory called _delta_log within every Delta table. It is an ordered record of every transaction performed on the table.
Transaction Log Entries
Each commit is a JSON file containing:
- Add file actions: New Parquet files to include in the table
- Remove file actions: Old Parquet files to exclude (logically deleted)
- Metadata changes: Schema updates, table properties changes
- Protocol changes: Reader/writer protocol version updates
Checkpoints
Every 10 commits, Delta Lake creates a checkpoint file (in Parquet format) that summarizes the current state of the table. This speeds up reads by eliminating the need to replay all individual commit files.
Time Travel
Time travel lets you query historical versions of a Delta table:
-- Query by version number
SELECT * FROM my_table VERSION AS OF 5;
-- Query by timestamp
SELECT * FROM my_table TIMESTAMP AS OF '2026-03-01';
-- Using the @ shortcut syntax
SELECT * FROM my_table@v5;
SELECT * FROM my_table@20260301;
Use Cases for Time Travel
- Audit and compliance: See exactly what the data looked like at any point
- Debugging: Compare current data to a previous version to find when a bug was introduced
- Reproducibility: Rerun ML training on the exact same dataset version
- Rollback: Restore a table to a previous version if a bad write occurs
Restoring a Previous Version
-- Restore table to version 5
RESTORE TABLE my_table TO VERSION AS OF 5;
-- Restore table to a timestamp
RESTORE TABLE my_table TO TIMESTAMP AS OF '2026-03-01';
DESCRIBE HISTORY
The DESCRIBE HISTORY command shows the audit log of all operations:
DESCRIBE HISTORY my_table;
This returns a table with columns including:
- version: The version number
- timestamp: When the operation occurred
- operation: The type of operation (WRITE, MERGE, DELETE, etc.)
- operationParameters: Details of the operation
- userIdentity: Who performed the operation
- operationMetrics: Rows affected, files added/removed
Schema Enforcement and Evolution
Schema Enforcement (Schema on Write)
When writing to a Delta table, Delta Lake checks that the incoming data matches the table schema:
- Column names must match
- Data types must be compatible
- Extra columns in the write data are rejected by default
- Missing columns receive null values
-- This will fail if new_data has columns not in the table schema
INSERT INTO my_table SELECT * FROM new_data;
Schema Evolution
When source data legitimately has new columns, you can enable schema evolution:
-- Enable schema evolution for a specific write
SET spark.databricks.delta.schema.autoMerge.enabled = true;
-- Or use mergeSchema option
INSERT INTO my_table
SELECT * FROM new_data
-- New columns in new_data will be added to the table schema
On the Exam: Know the difference between schema enforcement (rejects mismatched writes) and schema evolution (automatically adds new columns when enabled). Schema enforcement is the default behavior; schema evolution must be explicitly enabled.
What happens when a write operation to a Delta table fails midway through execution?
How does Delta Lake speed up reads after many transactions have occurred?
Which SQL syntax correctly queries version 10 of a Delta table named "orders"?
A data engineer writes data to a Delta table, but the incoming DataFrame has two extra columns not in the table schema. What is the default behavior?
Which command shows the complete history of all operations performed on a Delta table, including who performed each operation?
A data engineer accidentally updates the wrong rows in a Delta table. How can they undo this change?