What is the default retention period for the VACUUM command in Delta Lake?

7 days. VACUUM defaults to removing files older than 7 days. This means time travel is available for the last 7 days of history. You can change the retention period, but reducing it below 7 days requires disabling the safety check and risks data corruption for concurrent readers.

A data engineer runs VACUUM with a 3-day retention on a Delta table and later tries to query VERSION AS OF from 5 days ago. What happens?

The query fails because VACUUM deleted the data files needed for that version. VACUUM permanently deletes files no longer referenced by recent versions. If retention is set to 3 days and VACUUM runs, files for versions older than 3 days are deleted. Time travel to version 5 days ago would fail because the required data files no longer exist.

What is the primary advantage of Liquid Clustering over traditional Z-ORDER?

Liquid Clustering allows changing clustering keys without a full table rewrite and clusters data incrementally. Liquid Clustering is dynamic — you can change clustering keys with ALTER TABLE without rewriting the entire table. It also works incrementally, clustering only new or unclustered data during OPTIMIZE. Z-ORDER requires a full table re-OPTIMIZE whenever keys change.

Which command should be run BEFORE VACUUM to ensure optimal file sizes?

OPTIMIZE. OPTIMIZE compacts small files into larger, optimally-sized files. Running OPTIMIZE before VACUUM ensures that the small files are first consolidated, then VACUUM cleans up the now-unreferenced small files, reclaiming storage space.

Delta Lake Optimization: OPTIMIZE, VACUUM, and Liquid Clustering

Quick Answer: OPTIMIZE compacts small files into larger ones for faster reads. Z-ORDER co-locates related data for efficient data skipping. Liquid Clustering is the modern, dynamic replacement for both partitioning and Z-ORDER. VACUUM cleans up old files to reclaim storage (default 7-day retention).

The Small Files Problem

Streaming ingestion and frequent appends create many small files. This degrades query performance because Spark must open and read each file individually, creating I/O overhead.

Example: A streaming job that runs every minute might create 1,440 small files per day. After a month, the table has 43,200 small files, causing queries to be extremely slow.

OPTIMIZE (File Compaction)

OPTIMIZE compacts small files into larger, optimally-sized files (typically ~1 GB):

-- Compact all files in the table
OPTIMIZE my_catalog.my_schema.orders;

-- Compact only files matching a predicate
OPTIMIZE my_catalog.my_schema.orders
WHERE order_date >= '2026-01-01';

How OPTIMIZE Works

Identifies files below the target size
Reads and combines small files into larger files
Writes new, larger Parquet files
Atomically updates the transaction log to reference the new files
Old small files become eligible for VACUUM

OPTIMIZE with Z-ORDER

Z-ORDER re-organizes data within files so that related values are co-located:

-- Compact and Z-ORDER by customer_id
OPTIMIZE my_catalog.my_schema.orders
ZORDER BY (customer_id);

-- Z-ORDER by multiple columns
OPTIMIZE my_catalog.my_schema.orders
ZORDER BY (customer_id, order_date);

How Z-ORDER Helps Queries:

Delta Lake stores min/max statistics for each column in each file
When a query filters by a Z-ORDER column, Spark skips files where the value cannot exist
Without Z-ORDER, values are randomly distributed across files (no skipping benefit)
Z-ORDER is most effective for high-cardinality columns used in WHERE clauses

Z-ORDER Limitations

Operates on the entire table (or partition) each time — expensive for large tables
Limited to existing partitioning boundaries
Cannot dynamically change Z-ORDER columns without full rewrite
Maximum benefit with 1-4 columns

Liquid Clustering (Modern Approach)

Liquid Clustering is the recommended approach for new Delta tables, replacing both partitioning and Z-ORDER:

-- Create a table with Liquid Clustering
CREATE TABLE my_catalog.my_schema.orders (
    order_id BIGINT,
    customer_id BIGINT,
    amount DECIMAL(10,2),
    order_date DATE
)
CLUSTER BY (customer_id, order_date);

-- Change clustering keys without rewriting data
ALTER TABLE my_catalog.my_schema.orders
CLUSTER BY (order_date, product_category);

-- Trigger clustering optimization
OPTIMIZE my_catalog.my_schema.orders;

-- Remove clustering
ALTER TABLE my_catalog.my_schema.orders CLUSTER BY NONE;

Liquid Clustering vs. Z-ORDER vs. Partitioning

Feature	Partitioning	Z-ORDER	Liquid Clustering
Data layout	Physical directories	Co-located within files	Dynamic data layout
Change keys	Requires full rewrite	Requires full re-OPTIMIZE	ALTER TABLE (no rewrite)
Incremental	N/A	No — full table re-OPTIMIZE	Yes — only clusters new data
Write performance	Overhead per partition	Overhead during OPTIMIZE	Minimal overhead
Best for	Low-cardinality columns	High-cardinality filter columns	All use cases
Compatibility	With Z-ORDER	With partitioning	Cannot use with partitioning or Z-ORDER

When to Use Liquid Clustering

Databricks recommends Liquid Clustering for:

All new tables being created
Tables with frequently changing query patterns (key flexibility)
Tables with high-cardinality filter columns
Tables with frequent writes (incremental clustering)
Tables where concurrent write conflicts need to be minimized

VACUUM (Storage Cleanup)

VACUUM removes data files no longer referenced by the Delta transaction log:

-- Remove files older than the default retention (7 days)
VACUUM my_catalog.my_schema.orders;

-- Remove files older than 24 hours
VACUUM my_catalog.my_schema.orders RETAIN 24 HOURS;

-- Dry run: see which files would be deleted without actually deleting
VACUUM my_catalog.my_schema.orders DRY RUN;

VACUUM Rules and Behavior

Rule	Detail
Default retention	7 days
Minimum retention	7 days (configurable with `delta.deletedFileRetentionDuration`)
Files affected	Only files NOT referenced by the current transaction log
Time travel impact	Versions older than the retention period become inaccessible
Concurrency	Safe to run while other operations are happening
Cannot undo	VACUUM deletes files permanently

Setting Retention Period

-- Set table-level retention to 30 days
ALTER TABLE my_catalog.my_schema.orders
SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = '30 days');

WARNING: Setting retention below 7 days requires disabling the safety check: SET spark.databricks.delta.retentionDurationCheck.enabled = false;. This is risky because concurrent readers may still need the old files.

OPTIMIZE + VACUUM Workflow

-- Step 1: Compact small files
OPTIMIZE my_catalog.my_schema.orders;

-- Step 2: Clean up old files (the small files that were compacted)
VACUUM my_catalog.my_schema.orders;

On the Exam: VACUUM is irreversible — once files are deleted, you cannot time travel to those versions. Always understand the relationship between VACUUM retention and time travel capability. Also know that Liquid Clustering is the recommended modern approach over partitioning + Z-ORDER.

Databricks Certified Data Engineer Associate

3.2 Delta Lake Optimization: OPTIMIZE, VACUUM, and Liquid Clustering

Key Takeaways

Delta Lake Optimization: OPTIMIZE, VACUUM, and Liquid Clustering

The Small Files Problem

OPTIMIZE (File Compaction)

How OPTIMIZE Works

OPTIMIZE with Z-ORDER

Z-ORDER Limitations

Liquid Clustering (Modern Approach)

Liquid Clustering vs. Z-ORDER vs. Partitioning

When to Use Liquid Clustering

VACUUM (Storage Cleanup)

VACUUM Rules and Behavior

Setting Retention Period

OPTIMIZE + VACUUM Workflow

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

3.2 Delta Lake Optimization: OPTIMIZE, VACUUM, and Liquid Clustering

Key Takeaways

Delta Lake Optimization: OPTIMIZE, VACUUM, and Liquid Clustering

The Small Files Problem

OPTIMIZE (File Compaction)

How OPTIMIZE Works

OPTIMIZE with Z-ORDER

Z-ORDER Limitations

Liquid Clustering (Modern Approach)

Liquid Clustering vs. Z-ORDER vs. Partitioning

When to Use Liquid Clustering

VACUUM (Storage Cleanup)

VACUUM Rules and Behavior

Setting Retention Period

OPTIMIZE + VACUUM Workflow