4.5 Pipeline Performance and Cost Optimization

Key Takeaways

  • Adaptive Query Execution (AQE) in Spark dynamically optimizes query plans at runtime based on actual data statistics.
  • Broadcast joins automatically send small tables to all executors, avoiding expensive shuffle operations for joins with one small table.
  • Partition pruning and data skipping reduce I/O by reading only relevant files based on WHERE clause filters.
  • Right-sizing clusters (matching worker count and node types to workload) prevents both over-provisioning (waste) and under-provisioning (slowness).
  • Caching frequently accessed DataFrames with .cache() or CACHE TABLE avoids redundant recomputation in iterative workflows.
Last updated: March 2026

Pipeline Performance and Cost Optimization

Quick Answer: Enable Adaptive Query Execution (AQE) for automatic runtime optimization. Use broadcast joins for small tables. Leverage partition pruning and data skipping. Right-size clusters with autoscaling. Cache frequently accessed DataFrames.

Adaptive Query Execution (AQE)

AQE dynamically optimizes query plans during execution based on actual runtime statistics:

FeatureWhat It Does
Coalescing partitionsMerges small shuffle partitions to reduce overhead
Skew handlingSplits skewed partitions to balance work across tasks
Broadcast join conversionSwitches to broadcast join if a table is small enough
Dynamic filteringPrunes data based on join results
-- AQE is enabled by default in Databricks Runtime
-- Verify it is enabled
SET spark.sql.adaptive.enabled;  -- Should be true

Broadcast Joins

When joining a large table with a small table, broadcast the small table to all executors:

from pyspark.sql.functions import broadcast

# Explicitly broadcast the small table
result = large_df.join(broadcast(small_df), "join_key")
-- SQL hint for broadcast join
SELECT /*+ BROADCAST(dim_products) */
    o.*, p.product_name
FROM orders o
JOIN dim_products p ON o.product_id = p.product_id;

When to Broadcast

  • The small table fits in memory (< 10 GB by default)
  • One table is significantly smaller than the other (>10x difference)
  • AQE may automatically broadcast if it detects the table is small enough

Partition Pruning and Data Skipping

Delta Lake Data Skipping

Delta Lake stores min/max statistics for each column in each file. When a query has a WHERE clause, Spark skips files where the filter value falls outside the min/max range.

-- This query skips files where order_date range doesn't include 2026-03-01
SELECT * FROM orders WHERE order_date = '2026-03-01';

Maximizing Data Skipping

  • OPTIMIZE + Z-ORDER co-locates related values in the same files
  • Liquid Clustering achieves the same with dynamic, incremental optimization
  • Filter on clustering key columns for maximum skip benefit

Cluster Right-Sizing

Workload TypeRecommended Configuration
Small ETL jobs2-4 workers, standard instance types
Large batch processing8-20+ workers with autoscaling
SQL analyticsServerless SQL warehouse (auto-managed)
StreamingFixed-size cluster (autoscaling can cause instability)
ML trainingGPU instances, memory-optimized nodes

Cost Optimization Tips

  • Use spot instances for worker nodes (60-90% savings)
  • Enable autoscaling for variable workloads
  • Set auto-termination to shut down idle clusters (e.g., 30 minutes)
  • Use job clusters instead of all-purpose clusters for automated jobs
  • Monitor cluster utilization via the compute metrics page
  • Schedule heavy jobs during off-peak hours for lower spot prices

Caching

# Cache a frequently accessed DataFrame
filtered_df = spark.table("large_table").filter("date >= '2026-01-01'")
filtered_df.cache()

# Force materialization of the cache
filtered_df.count()

# Now subsequent actions use the cached data
filtered_df.groupBy("category").count().show()
filtered_df.filter("amount > 100").count()

# Uncache when done
filtered_df.unpersist()
-- SQL caching
CACHE TABLE frequent_data AS
SELECT * FROM large_table WHERE date >= '2026-01-01';

-- Uncache
UNCACHE TABLE frequent_data;

When to Cache

  • The same DataFrame is used multiple times in a notebook
  • The transformation is expensive (complex joins, aggregations)
  • The result fits in memory across the cluster

When NOT to Cache

  • DataFrame is used only once
  • Data changes frequently (cache becomes stale)
  • The dataset is too large for cluster memory

On the Exam: Know that AQE is enabled by default and handles skew, small partitions, and join optimization. Understand broadcast joins for small-to-large table joins. Know that caching is useful for repeated DataFrame access.

Test Your Knowledge

What does Adaptive Query Execution (AQE) do when it detects a skewed partition during a shuffle operation?

A
B
C
D
Test Your Knowledge

When is it appropriate to use a broadcast join?

A
B
C
D
Test Your Knowledge

A data engineer caches a large DataFrame but notices the cluster is running out of memory. What should they do?

A
B
C
D
Test Your Knowledge

Which Spark optimization automatically sends small shuffle partitions to the same executor to reduce overhead?

A
B
C
D