3.2 Delta Lake Optimization: OPTIMIZE, VACUUM, and Liquid Clustering

Key Takeaways

OPTIMIZE performs bin-packing, compacting many small files into fewer larger ones to cut read overhead and the small-file problem.
Z-ORDER co-locates related data across files on chosen columns for data skipping, but cannot be combined with Liquid Clustering.
Liquid Clustering (CLUSTER BY) is the recommended replacement for partitioning and Z-ORDER, re-clusters incrementally, and lets you change keys without rewriting the table.
VACUUM deletes data files no longer referenced by the transaction log; the default retention is 7 days (168 hours), and removed files break time travel to those versions.
Predictive Optimization automatically runs OPTIMIZE, VACUUM, and ANALYZE on Unity Catalog managed tables using serverless compute, and deletion vectors mark rows deleted without rewriting whole files.

Last updated: June 2026

OPTIMIZE and the Small-File Problem

Streaming and frequent micro-batch writes create many tiny files, which slows reads because each file carries open/close and metadata overhead. OPTIMIZE solves this with bin-packing: it compacts numerous small files into a smaller number of larger files (targeting roughly 1 GB each) without changing the data itself.

OPTIMIZE sales;
OPTIMIZE sales WHERE order_date >= '2026-01-01';

OPTIMIZE is idempotent — running it again on an already-optimized table does little work. It is a maintenance command, not a transformation, so the row count and values are unchanged; only the physical file layout improves.

Z-ORDER: Data Skipping via Co-location

Z-ORDER BY rewrites data so that rows with similar values in the specified columns sit in the same files. Combined with Delta's per-file min/max statistics, this enables data skipping: queries that filter on a Z-ordered column read only the relevant files. Z-ordering is most useful on high-cardinality columns you frequently filter on (for example customer_id).

OPTIMIZE events ZORDER BY (device_id, event_date);

Key limitation: Z-ORDER cannot be used together with Liquid Clustering on the same table.

Liquid Clustering: The Modern Replacement

Liquid Clustering is Databricks' current recommendation for all new tables, replacing both fixed partitioning and Z-ORDER. You declare clustering keys with CLUSTER BY, and Delta automatically lays out data to skip efficiently on those keys.

CREATE TABLE orders (id BIGINT, region STRING, ts TIMESTAMP)
CLUSTER BY (region, ts);

-- change keys later without rewriting the whole table:
ALTER TABLE orders CLUSTER BY (region);

-- let Databricks pick keys automatically:
CREATE TABLE orders CLUSTER BY AUTO;

The advantages the exam highlights:

Incremental re-clustering — only modified data is re-clustered, not the whole table, so maintenance is cheaper than a full Z-ORDER rewrite.
Flexible keys — clustering columns can be changed with ALTER TABLE without rewriting existing files.
No partition skew — it avoids the over- and under-partitioning problems of static PARTITIONED BY.

Technique	Change keys cheaply?	Incremental?	Combine with each other
Partitioning	No (rewrite)	No	—
Z-ORDER	No	No (full OPTIMIZE)	Not with Liquid Clustering
Liquid Clustering	Yes	Yes	Replaces both above

VACUUM, Deletion Vectors, and Predictive Optimization

VACUUM permanently removes data files that are no longer referenced by the Delta transaction log — files left behind by updates, deletes, merges, and OPTIMIZE. It reclaims storage but is destructive to history:

VACUUM sales;            -- uses default 7-day retention
VACUUM sales RETAIN 168 HOURS;

The default retention is 7 days (168 hours). Files newer than the threshold are kept so that in-flight readers and time travel still work; once VACUUM deletes older files you can no longer time-travel to versions that depended on them. Databricks blocks retention shorter than 7 days unless you explicitly disable the safety check — a common trap in exam scenarios about losing time-travel history.

Deletion vectors make deletes and updates efficient: instead of rewriting an entire data file to remove a few rows, Delta records which rows are logically deleted in a separate deletion-vector file. The underlying Parquet stays put, and a later OPTIMIZE physically applies the changes. This is the merge-on-read optimization.

Predictive Optimization automatically schedules OPTIMIZE, VACUUM, and ANALYZE for Unity Catalog managed tables using serverless compute, and for liquid-clustered tables it works with CLUSTER BY AUTO to choose keys and run incremental clustering — removing the need to hand-schedule maintenance jobs.

Putting the Maintenance Tools Together

These commands address different problems, and a frequent exam trap is swapping their roles. Use this decision guide:

Symptom / Goal	Command
Slow reads from thousands of tiny files	OPTIMIZE (bin-packing)
Queries filter one high-cardinality column on a legacy table	OPTIMIZE ZORDER BY
New table needs flexible, incremental data skipping	CLUSTER BY (Liquid Clustering)
Storage bill rising from old unreferenced files	VACUUM
Frequent small deletes/updates without full rewrites	Deletion vectors (merge-on-read)
Don't want to schedule any of the above manually	Predictive Optimization

Order of operations and gotchas

OPTIMIZE before VACUUM is the typical maintenance cadence: compact first, then reclaim the now-orphaned small files (respecting retention).
VACUUM is irreversible. Once files are gone, time travel and RESTORE to those versions are impossible. Treat the 7-day default as a safety floor, not a number to lower casually.
Z-ORDER and Liquid Clustering are mutually exclusive on a table — you migrate from one to the other, you don't run both.
Deletion vectors plus OPTIMIZE: deletes are fast because the file isn't rewritten immediately; a later OPTIMIZE physically purges the soft-deleted rows.
Statistics matter for skipping. Data skipping relies on per-file min/max stats collected on the first 32 columns by default, which is why clustering and Z-ordering only help on columns that have usable statistics.
Auto-compaction and optimized writes are table properties (delta.autoOptimize.autoCompact, optimizeWrite) that reduce the small-file problem at write time, complementing manual OPTIMIZE.
ANALYZE TABLE ... COMPUTE STATISTICS refreshes table and column statistics the cost-based optimizer uses to plan joins, and Predictive Optimization can run it automatically. Keeping stats fresh is what lets the optimizer choose broadcast versus shuffle joins correctly on silver and gold tables.

Test Your Knowledge

A team wants efficient data skipping on a new Delta table, the ability to change the clustering column later without rewriting the table, and incremental maintenance. What should they use?

Static PARTITIONED BY on the column

OPTIMIZE ZORDER BY on the column after every load

Liquid Clustering with CLUSTER BY

VACUUM with a 0-hour retention

Test Your Knowledge

What does OPTIMIZE do to a Delta table?

Compacts many small files into fewer larger files (bin-packing) without changing the data

Permanently deletes unreferenced data files to reclaim storage

Removes duplicate rows based on a primary key

Marks rows as deleted using a deletion vector

Test Your Knowledge

A user runs VACUUM with the default settings, then tries to time travel to a table version from two weeks ago and it fails. Why?

OPTIMIZE disables time travel permanently

VACUUM's default 7-day retention removed the older data files those versions depended on

Time travel requires Liquid Clustering to be enabled

Deletion vectors overwrote the transaction log

Test Your Knowledge

Which statement about Predictive Optimization is correct?

It only works on Hive metastore external tables

It replaces the need to write any SQL transformations

It automatically schedules OPTIMIZE, VACUUM, and ANALYZE for Unity Catalog managed tables using serverless compute

It permanently disables VACUUM to protect time travel

Up Next

3.3 Advanced SQL Transformations

Continue learning

Databricks Certified Data Engineer Associate

Databricks Certified Data Engineer Associate

3.2 Delta Lake Optimization: OPTIMIZE, VACUUM, and Liquid Clustering

Key Takeaways

OPTIMIZE and the Small-File Problem

Z-ORDER: Data Skipping via Co-location

Liquid Clustering: The Modern Replacement

VACUUM, Deletion Vectors, and Predictive Optimization

Putting the Maintenance Tools Together

Order of operations and gotchas

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

Databricks Certified Data Engineer Associate

3.2 Delta Lake Optimization: OPTIMIZE, VACUUM, and Liquid Clustering

Key Takeaways

OPTIMIZE and the Small-File Problem

Z-ORDER: Data Skipping via Co-location

Liquid Clustering: The Modern Replacement

VACUUM, Deletion Vectors, and Predictive Optimization

Putting the Maintenance Tools Together

Order of operations and gotchas