Which statement about Photon is correct?

Photon is a vectorized C++ engine that accelerates SQL/DataFrame work and often improves price/performance despite a higher DBU rate. Photon is a vectorized C++ query engine that transparently speeds up SQL and DataFrame operations; even though Photon DBUs cost more per hour, the shorter runtime usually yields better overall price/performance.

An ingestion workload has no low-latency requirement. Which choice minimizes compute cost while still processing new files?

Schedule Auto Loader as a batch job using Trigger.AvailableNow so it processes new files then stops. Trigger.AvailableNow makes Auto Loader wake, process all newly arrived files, and stop, so you pay only while data is processed instead of keeping a stream and cluster running continuously.

Why is standard autoscaling a poor fit for Structured Streaming jobs?

It struggles to scale down because the streaming query keeps tasks busy, leaving clusters oversized. Standard autoscaling has trouble scaling down for streaming because a continuous query keeps tasks occupied, so clusters stay large and waste money. Lakeflow Declarative Pipelines enhanced autoscaling is designed to downscale aggressively during low load.

When using spot/preemptible instances for a batch job, what is the recommended configuration?

Keep the driver on-demand and use spot for workers, relying on Spark to recompute lost work. Spot workers can be reclaimed, but Spark recomputes the lost work, so spot is ideal for fault-tolerant batch — provided the driver stays on-demand so reclaiming a worker does not kill the entire job.

Pipeline Performance and Cost Optimization | Free Guide 2026

Key Takeaways

Photon is a vectorized C++ query engine that speeds up SQL and DataFrame work; enable it on production clusters and SQL warehouses for better price/performance.
Right-size clusters and use autoscaling so you pay for needed compute; spot/preemptible instances cut cost for fault-tolerant batch but can be reclaimed.
Standard autoscaling scales down poorly for streaming; Lakeflow Declarative Pipelines enhanced autoscaling is purpose-built to downscale aggressively during low load.
For non-low-latency ingestion, schedule Auto Loader as batch with Trigger.AvailableNow (or a file-arrival trigger) instead of running it continuously to slash idle compute cost.
Job clusters at the Jobs Compute rate, Delta optimizations (file compaction, liquid clustering, data skipping), and Photon together reduce both runtime and DBU spend.

Photon

Photon is Databricks's vectorized query engine, written in C++ and built to exploit modern CPUs (SIMD vectorization). It transparently accelerates SQL and DataFrame operations — joins, aggregations, writes — with no code change; you just enable Photon on the cluster or SQL warehouse. Because Photon runs the same query in less time, it usually improves price/performance even though Photon DBUs cost more per hour: shorter runtime more than offsets the higher rate for analytic workloads.

Photon accelerates the work it supports (most SQL/DataFrame operations and Delta writes) and falls back to standard Spark for the rest, so enabling it is generally safe. For production pipelines doing heavy joins, aggregations, and MERGEs, Photon is a default recommendation.

Cluster Sizing, Autoscaling, and Spot

Right-sizing means matching worker count and instance type to the workload. Levers:

Lever	Effect	Trade-off
Autoscaling (min/max workers)	Adds/removes workers with load	Scale-up lag on bursty work
Spot / preemptible instances	Big discount on worker VMs	Cloud can reclaim them mid-run
Job clusters	Jobs Compute DBU rate (cheaper than all-purpose)	Startup time per run
Instance type	Memory- vs compute-optimized fit	Wrong type wastes spend

Spot instances are ideal for fault-tolerant batch because Databricks reruns work lost to a reclaimed spot worker; keep the driver on-demand so the whole job is not killed. Set autoscaling max to bound cost and min low so idle clusters shrink. Always prefer job clusters for scheduled work to get the lower Jobs Compute rate.

Streaming Autoscaling and Batch Triggering

Streaming changes the autoscaling story. Standard autoscaling does not work well with Structured Streaming — it struggles to scale down because a streaming query keeps tasks busy, so clusters stay large and waste money. Lakeflow Declarative Pipelines enhanced autoscaling is purpose-built for streaming: it downscales aggressively during low-volume or idle periods while preserving throughput under load, especially on Photon.

The biggest cost lever for ingestion is not running streams you do not need continuously. If the workload has no sub-minute latency requirement, schedule Auto Loader as a batch job with Trigger.AvailableNow: the stream wakes, processes all newly arrived files, then stops, so you pay only while data is processed. Pair this with a file-arrival trigger on the job to keep latency low without an always-on cluster — the best of both worlds.

Delta-Layer Optimizations

Faster queries are also cheaper queries, so storage-layer tuning is cost tuning:

File compaction — OPTIMIZE (and Auto Optimize / Optimized Writes) merges many small files into right-sized ones, fixing the small-files problem that slows reads.
Liquid clustering — CLUSTER BY adapts data layout to query patterns; it has largely replaced manual partitioning + Z-ORDER for new tables.
Data skipping — Delta keeps min/max stats per file so queries prune files they do not need; clustering on filter columns maximizes skipping.
VACUUM — removes stale files past the retention window to control storage cost (default 7-day retention protects time travel).

Combined, Photon + job clusters + right-sized autoscaling + Trigger.AvailableNow + Delta optimization cut both wall-clock runtime and DBU spend, which is exactly what the exam means by cost optimization.

Serverless Compute and DBU Economics

Serverless compute removes cluster startup time and management: Databricks provisions capacity instantly and bills only for what runs. For short or bursty jobs the eliminated startup latency and per-second billing often beat a classic job cluster, while for long steady jobs a right-sized classic job cluster can still be cheaper. The exam frames cost as a deliberate choice among serverless, job clusters, and all-purpose clusters — never default to the always-on all-purpose cluster for scheduled work, since it bills at the highest rate even while idle.

A quick mental cost model:

Cost ≈ DBU rate × cluster size × runtime.
Photon shrinks runtime; right-sizing/autoscaling shrinks cluster size; job/serverless compute shrinks the DBU rate versus all-purpose; Trigger.AvailableNow shrinks runtime to near zero when idle.

Tuning the Spark Workload

Runtime also depends on how Spark executes the job. Watch for data skew (one partition far larger than the rest stalls a stage) and the small-files problem (thousands of tiny files create per-file overhead). Adaptive Query Execution (on by default) re-plans joins and coalesces shuffle partitions at runtime. Caching a reused DataFrame avoids recomputation. Each of these reduces wall-clock time, and on Databricks less time is directly less cost — performance tuning and cost optimization are the same activity.

Exam pointers

Trigger.AvailableNow turns continuous ingestion into cheap batch when low latency is not required. Delta OPTIMIZE, liquid clustering, and data skipping speed reads, which on Databricks directly lowers cost. Treat cost as the product of DBU rate, cluster size, and runtime, then pull each lever deliberately rather than over-provisioning a large always-on all-purpose cluster, which is the single most common source of avoidable spend.

Optimization quick recap

Photon (vectorized C++ engine) accelerates SQL and DataFrame workloads; it is always on for serverless and SQL warehouses.
Right-size clusters and use autoscaling; for streaming, prefer Lakeflow enhanced autoscaling over standard.
Reduce cost with Trigger.AvailableNow batch ingestion, spot workers (on-demand driver), and liquid clustering + OPTIMIZE for data skipping.

Databricks Certified Data Engineer Associate

Databricks Certified Data Engineer Associate

4.5 Pipeline Performance and Cost Optimization

Key Takeaways

Photon

Cluster Sizing, Autoscaling, and Spot

Streaming Autoscaling and Batch Triggering

Delta-Layer Optimizations

Serverless Compute and DBU Economics

Tuning the Spark Workload

Exam pointers

Optimization quick recap

Databricks Certified Data Engineer Associate

1Introduction

2Domain 1: Databricks Intelligence Platform (10%)

3Domain 2: Development and Ingestion (30%)

4Domain 3: Data Processing & Transformations (31%)

5Domain 4: Productionizing Data Pipelines (18%)

6Domain 5: Data Governance & Quality (11%)

Databricks Certified Data Engineer Associate

4.5 Pipeline Performance and Cost Optimization

Key Takeaways

Photon

Cluster Sizing, Autoscaling, and Spot

Streaming Autoscaling and Batch Triggering

Delta-Layer Optimizations

Serverless Compute and DBU Economics

Tuning the Spark Workload

Exam pointers

Optimization quick recap