4.5 Pipeline Performance and Cost Optimization
Key Takeaways
- Adaptive Query Execution (AQE) in Spark dynamically optimizes query plans at runtime based on actual data statistics.
- Broadcast joins automatically send small tables to all executors, avoiding expensive shuffle operations for joins with one small table.
- Partition pruning and data skipping reduce I/O by reading only relevant files based on WHERE clause filters.
- Right-sizing clusters (matching worker count and node types to workload) prevents both over-provisioning (waste) and under-provisioning (slowness).
- Caching frequently accessed DataFrames with .cache() or CACHE TABLE avoids redundant recomputation in iterative workflows.
Pipeline Performance and Cost Optimization
Quick Answer: Enable Adaptive Query Execution (AQE) for automatic runtime optimization. Use broadcast joins for small tables. Leverage partition pruning and data skipping. Right-size clusters with autoscaling. Cache frequently accessed DataFrames.
Adaptive Query Execution (AQE)
AQE dynamically optimizes query plans during execution based on actual runtime statistics:
| Feature | What It Does |
|---|---|
| Coalescing partitions | Merges small shuffle partitions to reduce overhead |
| Skew handling | Splits skewed partitions to balance work across tasks |
| Broadcast join conversion | Switches to broadcast join if a table is small enough |
| Dynamic filtering | Prunes data based on join results |
-- AQE is enabled by default in Databricks Runtime
-- Verify it is enabled
SET spark.sql.adaptive.enabled; -- Should be true
Broadcast Joins
When joining a large table with a small table, broadcast the small table to all executors:
from pyspark.sql.functions import broadcast
# Explicitly broadcast the small table
result = large_df.join(broadcast(small_df), "join_key")
-- SQL hint for broadcast join
SELECT /*+ BROADCAST(dim_products) */
o.*, p.product_name
FROM orders o
JOIN dim_products p ON o.product_id = p.product_id;
When to Broadcast
- The small table fits in memory (< 10 GB by default)
- One table is significantly smaller than the other (>10x difference)
- AQE may automatically broadcast if it detects the table is small enough
Partition Pruning and Data Skipping
Delta Lake Data Skipping
Delta Lake stores min/max statistics for each column in each file. When a query has a WHERE clause, Spark skips files where the filter value falls outside the min/max range.
-- This query skips files where order_date range doesn't include 2026-03-01
SELECT * FROM orders WHERE order_date = '2026-03-01';
Maximizing Data Skipping
- OPTIMIZE + Z-ORDER co-locates related values in the same files
- Liquid Clustering achieves the same with dynamic, incremental optimization
- Filter on clustering key columns for maximum skip benefit
Cluster Right-Sizing
| Workload Type | Recommended Configuration |
|---|---|
| Small ETL jobs | 2-4 workers, standard instance types |
| Large batch processing | 8-20+ workers with autoscaling |
| SQL analytics | Serverless SQL warehouse (auto-managed) |
| Streaming | Fixed-size cluster (autoscaling can cause instability) |
| ML training | GPU instances, memory-optimized nodes |
Cost Optimization Tips
- Use spot instances for worker nodes (60-90% savings)
- Enable autoscaling for variable workloads
- Set auto-termination to shut down idle clusters (e.g., 30 minutes)
- Use job clusters instead of all-purpose clusters for automated jobs
- Monitor cluster utilization via the compute metrics page
- Schedule heavy jobs during off-peak hours for lower spot prices
Caching
# Cache a frequently accessed DataFrame
filtered_df = spark.table("large_table").filter("date >= '2026-01-01'")
filtered_df.cache()
# Force materialization of the cache
filtered_df.count()
# Now subsequent actions use the cached data
filtered_df.groupBy("category").count().show()
filtered_df.filter("amount > 100").count()
# Uncache when done
filtered_df.unpersist()
-- SQL caching
CACHE TABLE frequent_data AS
SELECT * FROM large_table WHERE date >= '2026-01-01';
-- Uncache
UNCACHE TABLE frequent_data;
When to Cache
- The same DataFrame is used multiple times in a notebook
- The transformation is expensive (complex joins, aggregations)
- The result fits in memory across the cluster
When NOT to Cache
- DataFrame is used only once
- Data changes frequently (cache becomes stale)
- The dataset is too large for cluster memory
On the Exam: Know that AQE is enabled by default and handles skew, small partitions, and join optimization. Understand broadcast joins for small-to-large table joins. Know that caching is useful for repeated DataFrame access.
What does Adaptive Query Execution (AQE) do when it detects a skewed partition during a shuffle operation?
When is it appropriate to use a broadcast join?
A data engineer caches a large DataFrame but notices the cluster is running out of memory. What should they do?
Which Spark optimization automatically sends small shuffle partitions to the same executor to reduce overhead?