3.2 Delta Lake Optimization: OPTIMIZE, VACUUM, and Liquid Clustering

Key Takeaways

  • OPTIMIZE compacts small files into larger ones (bin-packing), improving read performance for downstream queries.
  • Z-ORDER co-locates related data in the same files based on specified columns, enabling efficient data skipping.
  • Liquid Clustering is the modern replacement for partitioning and Z-ORDER, allowing dynamic clustering key changes without full table rewrites.
  • VACUUM removes data files no longer referenced by the Delta transaction log, reclaiming storage space.
  • VACUUM has a default retention of 7 days — files older than the retention period are permanently deleted, preventing time travel to those versions.
Last updated: March 2026

Delta Lake Optimization: OPTIMIZE, VACUUM, and Liquid Clustering

Quick Answer: OPTIMIZE compacts small files into larger ones for faster reads. Z-ORDER co-locates related data for efficient data skipping. Liquid Clustering is the modern, dynamic replacement for both partitioning and Z-ORDER. VACUUM cleans up old files to reclaim storage (default 7-day retention).

The Small Files Problem

Streaming ingestion and frequent appends create many small files. This degrades query performance because Spark must open and read each file individually, creating I/O overhead.

Example: A streaming job that runs every minute might create 1,440 small files per day. After a month, the table has 43,200 small files, causing queries to be extremely slow.

OPTIMIZE (File Compaction)

OPTIMIZE compacts small files into larger, optimally-sized files (typically ~1 GB):

-- Compact all files in the table
OPTIMIZE my_catalog.my_schema.orders;

-- Compact only files matching a predicate
OPTIMIZE my_catalog.my_schema.orders
WHERE order_date >= '2026-01-01';

How OPTIMIZE Works

  1. Identifies files below the target size
  2. Reads and combines small files into larger files
  3. Writes new, larger Parquet files
  4. Atomically updates the transaction log to reference the new files
  5. Old small files become eligible for VACUUM

OPTIMIZE with Z-ORDER

Z-ORDER re-organizes data within files so that related values are co-located:

-- Compact and Z-ORDER by customer_id
OPTIMIZE my_catalog.my_schema.orders
ZORDER BY (customer_id);

-- Z-ORDER by multiple columns
OPTIMIZE my_catalog.my_schema.orders
ZORDER BY (customer_id, order_date);

How Z-ORDER Helps Queries:

  • Delta Lake stores min/max statistics for each column in each file
  • When a query filters by a Z-ORDER column, Spark skips files where the value cannot exist
  • Without Z-ORDER, values are randomly distributed across files (no skipping benefit)
  • Z-ORDER is most effective for high-cardinality columns used in WHERE clauses

Z-ORDER Limitations

  • Operates on the entire table (or partition) each time — expensive for large tables
  • Limited to existing partitioning boundaries
  • Cannot dynamically change Z-ORDER columns without full rewrite
  • Maximum benefit with 1-4 columns

Liquid Clustering (Modern Approach)

Liquid Clustering is the recommended approach for new Delta tables, replacing both partitioning and Z-ORDER:

-- Create a table with Liquid Clustering
CREATE TABLE my_catalog.my_schema.orders (
    order_id BIGINT,
    customer_id BIGINT,
    amount DECIMAL(10,2),
    order_date DATE
)
CLUSTER BY (customer_id, order_date);

-- Change clustering keys without rewriting data
ALTER TABLE my_catalog.my_schema.orders
CLUSTER BY (order_date, product_category);

-- Trigger clustering optimization
OPTIMIZE my_catalog.my_schema.orders;

-- Remove clustering
ALTER TABLE my_catalog.my_schema.orders CLUSTER BY NONE;

Liquid Clustering vs. Z-ORDER vs. Partitioning

FeaturePartitioningZ-ORDERLiquid Clustering
Data layoutPhysical directoriesCo-located within filesDynamic data layout
Change keysRequires full rewriteRequires full re-OPTIMIZEALTER TABLE (no rewrite)
IncrementalN/ANo — full table re-OPTIMIZEYes — only clusters new data
Write performanceOverhead per partitionOverhead during OPTIMIZEMinimal overhead
Best forLow-cardinality columnsHigh-cardinality filter columnsAll use cases
CompatibilityWith Z-ORDERWith partitioningCannot use with partitioning or Z-ORDER

When to Use Liquid Clustering

Databricks recommends Liquid Clustering for:

  • All new tables being created
  • Tables with frequently changing query patterns (key flexibility)
  • Tables with high-cardinality filter columns
  • Tables with frequent writes (incremental clustering)
  • Tables where concurrent write conflicts need to be minimized

VACUUM (Storage Cleanup)

VACUUM removes data files no longer referenced by the Delta transaction log:

-- Remove files older than the default retention (7 days)
VACUUM my_catalog.my_schema.orders;

-- Remove files older than 24 hours
VACUUM my_catalog.my_schema.orders RETAIN 24 HOURS;

-- Dry run: see which files would be deleted without actually deleting
VACUUM my_catalog.my_schema.orders DRY RUN;

VACUUM Rules and Behavior

RuleDetail
Default retention7 days
Minimum retention7 days (configurable with delta.deletedFileRetentionDuration)
Files affectedOnly files NOT referenced by the current transaction log
Time travel impactVersions older than the retention period become inaccessible
ConcurrencySafe to run while other operations are happening
Cannot undoVACUUM deletes files permanently

Setting Retention Period

-- Set table-level retention to 30 days
ALTER TABLE my_catalog.my_schema.orders
SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = '30 days');

WARNING: Setting retention below 7 days requires disabling the safety check: SET spark.databricks.delta.retentionDurationCheck.enabled = false;. This is risky because concurrent readers may still need the old files.

OPTIMIZE + VACUUM Workflow

-- Step 1: Compact small files
OPTIMIZE my_catalog.my_schema.orders;

-- Step 2: Clean up old files (the small files that were compacted)
VACUUM my_catalog.my_schema.orders;

On the Exam: VACUUM is irreversible — once files are deleted, you cannot time travel to those versions. Always understand the relationship between VACUUM retention and time travel capability. Also know that Liquid Clustering is the recommended modern approach over partitioning + Z-ORDER.

Test Your Knowledge

What is the default retention period for the VACUUM command in Delta Lake?

A
B
C
D
Test Your Knowledge

A data engineer runs VACUUM with a 3-day retention on a Delta table and later tries to query VERSION AS OF from 5 days ago. What happens?

A
B
C
D
Test Your Knowledge

What is the primary advantage of Liquid Clustering over traditional Z-ORDER?

A
B
C
D
Test Your Knowledge

Which command should be run BEFORE VACUUM to ensure optimal file sizes?

A
B
C
D