3.2 Delta Lake Optimization: OPTIMIZE, VACUUM, and Liquid Clustering
Key Takeaways
- OPTIMIZE compacts small files into larger ones (bin-packing), improving read performance for downstream queries.
- Z-ORDER co-locates related data in the same files based on specified columns, enabling efficient data skipping.
- Liquid Clustering is the modern replacement for partitioning and Z-ORDER, allowing dynamic clustering key changes without full table rewrites.
- VACUUM removes data files no longer referenced by the Delta transaction log, reclaiming storage space.
- VACUUM has a default retention of 7 days — files older than the retention period are permanently deleted, preventing time travel to those versions.
Delta Lake Optimization: OPTIMIZE, VACUUM, and Liquid Clustering
Quick Answer: OPTIMIZE compacts small files into larger ones for faster reads. Z-ORDER co-locates related data for efficient data skipping. Liquid Clustering is the modern, dynamic replacement for both partitioning and Z-ORDER. VACUUM cleans up old files to reclaim storage (default 7-day retention).
The Small Files Problem
Streaming ingestion and frequent appends create many small files. This degrades query performance because Spark must open and read each file individually, creating I/O overhead.
Example: A streaming job that runs every minute might create 1,440 small files per day. After a month, the table has 43,200 small files, causing queries to be extremely slow.
OPTIMIZE (File Compaction)
OPTIMIZE compacts small files into larger, optimally-sized files (typically ~1 GB):
-- Compact all files in the table
OPTIMIZE my_catalog.my_schema.orders;
-- Compact only files matching a predicate
OPTIMIZE my_catalog.my_schema.orders
WHERE order_date >= '2026-01-01';
How OPTIMIZE Works
- Identifies files below the target size
- Reads and combines small files into larger files
- Writes new, larger Parquet files
- Atomically updates the transaction log to reference the new files
- Old small files become eligible for VACUUM
OPTIMIZE with Z-ORDER
Z-ORDER re-organizes data within files so that related values are co-located:
-- Compact and Z-ORDER by customer_id
OPTIMIZE my_catalog.my_schema.orders
ZORDER BY (customer_id);
-- Z-ORDER by multiple columns
OPTIMIZE my_catalog.my_schema.orders
ZORDER BY (customer_id, order_date);
How Z-ORDER Helps Queries:
- Delta Lake stores min/max statistics for each column in each file
- When a query filters by a Z-ORDER column, Spark skips files where the value cannot exist
- Without Z-ORDER, values are randomly distributed across files (no skipping benefit)
- Z-ORDER is most effective for high-cardinality columns used in WHERE clauses
Z-ORDER Limitations
- Operates on the entire table (or partition) each time — expensive for large tables
- Limited to existing partitioning boundaries
- Cannot dynamically change Z-ORDER columns without full rewrite
- Maximum benefit with 1-4 columns
Liquid Clustering (Modern Approach)
Liquid Clustering is the recommended approach for new Delta tables, replacing both partitioning and Z-ORDER:
-- Create a table with Liquid Clustering
CREATE TABLE my_catalog.my_schema.orders (
order_id BIGINT,
customer_id BIGINT,
amount DECIMAL(10,2),
order_date DATE
)
CLUSTER BY (customer_id, order_date);
-- Change clustering keys without rewriting data
ALTER TABLE my_catalog.my_schema.orders
CLUSTER BY (order_date, product_category);
-- Trigger clustering optimization
OPTIMIZE my_catalog.my_schema.orders;
-- Remove clustering
ALTER TABLE my_catalog.my_schema.orders CLUSTER BY NONE;
Liquid Clustering vs. Z-ORDER vs. Partitioning
| Feature | Partitioning | Z-ORDER | Liquid Clustering |
|---|---|---|---|
| Data layout | Physical directories | Co-located within files | Dynamic data layout |
| Change keys | Requires full rewrite | Requires full re-OPTIMIZE | ALTER TABLE (no rewrite) |
| Incremental | N/A | No — full table re-OPTIMIZE | Yes — only clusters new data |
| Write performance | Overhead per partition | Overhead during OPTIMIZE | Minimal overhead |
| Best for | Low-cardinality columns | High-cardinality filter columns | All use cases |
| Compatibility | With Z-ORDER | With partitioning | Cannot use with partitioning or Z-ORDER |
When to Use Liquid Clustering
Databricks recommends Liquid Clustering for:
- All new tables being created
- Tables with frequently changing query patterns (key flexibility)
- Tables with high-cardinality filter columns
- Tables with frequent writes (incremental clustering)
- Tables where concurrent write conflicts need to be minimized
VACUUM (Storage Cleanup)
VACUUM removes data files no longer referenced by the Delta transaction log:
-- Remove files older than the default retention (7 days)
VACUUM my_catalog.my_schema.orders;
-- Remove files older than 24 hours
VACUUM my_catalog.my_schema.orders RETAIN 24 HOURS;
-- Dry run: see which files would be deleted without actually deleting
VACUUM my_catalog.my_schema.orders DRY RUN;
VACUUM Rules and Behavior
| Rule | Detail |
|---|---|
| Default retention | 7 days |
| Minimum retention | 7 days (configurable with delta.deletedFileRetentionDuration) |
| Files affected | Only files NOT referenced by the current transaction log |
| Time travel impact | Versions older than the retention period become inaccessible |
| Concurrency | Safe to run while other operations are happening |
| Cannot undo | VACUUM deletes files permanently |
Setting Retention Period
-- Set table-level retention to 30 days
ALTER TABLE my_catalog.my_schema.orders
SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = '30 days');
WARNING: Setting retention below 7 days requires disabling the safety check:
SET spark.databricks.delta.retentionDurationCheck.enabled = false;. This is risky because concurrent readers may still need the old files.
OPTIMIZE + VACUUM Workflow
-- Step 1: Compact small files
OPTIMIZE my_catalog.my_schema.orders;
-- Step 2: Clean up old files (the small files that were compacted)
VACUUM my_catalog.my_schema.orders;
On the Exam: VACUUM is irreversible — once files are deleted, you cannot time travel to those versions. Always understand the relationship between VACUUM retention and time travel capability. Also know that Liquid Clustering is the recommended modern approach over partitioning + Z-ORDER.
What is the default retention period for the VACUUM command in Delta Lake?
A data engineer runs VACUUM with a 3-day retention on a Delta table and later tries to query VERSION AS OF from 5 days ago. What happens?
What is the primary advantage of Liquid Clustering over traditional Z-ORDER?
Which command should be run BEFORE VACUUM to ensure optimal file sizes?