What does Adaptive Query Execution (AQE) do when it detects a skewed partition during a shuffle operation?

It splits the skewed partition into smaller partitions to distribute work evenly across tasks. AQE detects skewed partitions at runtime and splits them into smaller, more evenly-sized partitions. This distributes work across tasks more evenly, preventing a single task from becoming a bottleneck while other tasks sit idle.

When is it appropriate to use a broadcast join?

When one table is small enough to fit in executor memory and the other is much larger. Broadcast joins send the entire small table to every executor, eliminating the need for an expensive shuffle. This works when one table is small enough to fit in memory (< 10 GB default). It is not suitable when both tables are large.

A data engineer caches a large DataFrame but notices the cluster is running out of memory. What should they do?

Call unpersist() to remove the cached data if it is no longer needed, or avoid caching datasets too large for cluster memory. If caching causes memory pressure, use unpersist() to remove cached DataFrames that are no longer needed. Only cache DataFrames that fit in cluster memory and are accessed multiple times. For datasets too large to cache, consider persisting them as Delta tables instead.

Which Spark optimization automatically sends small shuffle partitions to the same executor to reduce overhead?

Adaptive Query Execution (AQE) partition coalescing. AQE partition coalescing detects when shuffle partitions are too small (common after filtering) and merges them into larger partitions. This reduces the overhead of managing many tiny tasks, improving overall query performance.

Pipeline Performance and Cost Optimization

Quick Answer: Enable Adaptive Query Execution (AQE) for automatic runtime optimization. Use broadcast joins for small tables. Leverage partition pruning and data skipping. Right-size clusters with autoscaling. Cache frequently accessed DataFrames.

Adaptive Query Execution (AQE)

AQE dynamically optimizes query plans during execution based on actual runtime statistics:

Feature	What It Does
Coalescing partitions	Merges small shuffle partitions to reduce overhead
Skew handling	Splits skewed partitions to balance work across tasks
Broadcast join conversion	Switches to broadcast join if a table is small enough
Dynamic filtering	Prunes data based on join results

-- AQE is enabled by default in Databricks Runtime
-- Verify it is enabled
SET spark.sql.adaptive.enabled;  -- Should be true

Broadcast Joins

When joining a large table with a small table, broadcast the small table to all executors:

from pyspark.sql.functions import broadcast

# Explicitly broadcast the small table
result = large_df.join(broadcast(small_df), "join_key")

-- SQL hint for broadcast join
SELECT /*+ BROADCAST(dim_products) */
    o.*, p.product_name
FROM orders o
JOIN dim_products p ON o.product_id = p.product_id;

When to Broadcast

The small table fits in memory (< 10 GB by default)
One table is significantly smaller than the other (>10x difference)
AQE may automatically broadcast if it detects the table is small enough

Partition Pruning and Data Skipping

Delta Lake Data Skipping

Delta Lake stores min/max statistics for each column in each file. When a query has a WHERE clause, Spark skips files where the filter value falls outside the min/max range.

-- This query skips files where order_date range doesn't include 2026-03-01
SELECT * FROM orders WHERE order_date = '2026-03-01';

Maximizing Data Skipping

OPTIMIZE + Z-ORDER co-locates related values in the same files
Liquid Clustering achieves the same with dynamic, incremental optimization
Filter on clustering key columns for maximum skip benefit

Cluster Right-Sizing

Workload Type	Recommended Configuration
Small ETL jobs	2-4 workers, standard instance types
Large batch processing	8-20+ workers with autoscaling
SQL analytics	Serverless SQL warehouse (auto-managed)
Streaming	Fixed-size cluster (autoscaling can cause instability)
ML training	GPU instances, memory-optimized nodes

Cost Optimization Tips

Use spot instances for worker nodes (60-90% savings)
Enable autoscaling for variable workloads
Set auto-termination to shut down idle clusters (e.g., 30 minutes)
Use job clusters instead of all-purpose clusters for automated jobs
Monitor cluster utilization via the compute metrics page
Schedule heavy jobs during off-peak hours for lower spot prices

Caching

# Cache a frequently accessed DataFrame
filtered_df = spark.table("large_table").filter("date >= '2026-01-01'")
filtered_df.cache()

# Force materialization of the cache
filtered_df.count()

# Now subsequent actions use the cached data
filtered_df.groupBy("category").count().show()
filtered_df.filter("amount > 100").count()

# Uncache when done
filtered_df.unpersist()

-- SQL caching
CACHE TABLE frequent_data AS
SELECT * FROM large_table WHERE date >= '2026-01-01';

-- Uncache
UNCACHE TABLE frequent_data;

When to Cache

The same DataFrame is used multiple times in a notebook
The transformation is expensive (complex joins, aggregations)
The result fits in memory across the cluster

When NOT to Cache

DataFrame is used only once
Data changes frequently (cache becomes stale)
The dataset is too large for cluster memory

On the Exam: Know that AQE is enabled by default and handles skew, small partitions, and join optimization. Understand broadcast joins for small-to-large table joins. Know that caching is useful for repeated DataFrame access.

Databricks Certified Data Engineer Associate

4.5 Pipeline Performance and Cost Optimization

Key Takeaways

Pipeline Performance and Cost Optimization

Adaptive Query Execution (AQE)

Broadcast Joins

When to Broadcast

Partition Pruning and Data Skipping

Delta Lake Data Skipping

Maximizing Data Skipping

Cluster Right-Sizing

Cost Optimization Tips

Caching

When to Cache

When NOT to Cache

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

4.5 Pipeline Performance and Cost Optimization

Key Takeaways

Pipeline Performance and Cost Optimization

Adaptive Query Execution (AQE)

Broadcast Joins

When to Broadcast

Partition Pruning and Data Skipping

Delta Lake Data Skipping

Maximizing Data Skipping

Cluster Right-Sizing

Cost Optimization Tips

Caching

When to Cache

When NOT to Cache