1.6 Databricks Runtime and Compute Configuration
Key Takeaways
- Databricks Runtime (DBR) includes Apache Spark, Delta Lake, and pre-installed libraries optimized for the Databricks platform.
- Runtime versions follow a Long-Term Support (LTS) model — LTS versions receive security patches for an extended period.
- Cluster autoscaling automatically adds or removes worker nodes based on workload demand, optimizing cost and performance.
- Spot instances (preemptible VMs) offer significant cost savings (60-90%) but may be interrupted by the cloud provider.
- Init scripts run custom code during cluster startup, useful for installing additional libraries or configuring environment settings.
Databricks Runtime and Compute Configuration
Quick Answer: Databricks Runtime (DBR) bundles Spark, Delta Lake, and optimized libraries. Use LTS versions for production stability. Configure autoscaling for dynamic workloads, spot instances for cost savings, and init scripts for custom setup.
Databricks Runtime Versions
| Runtime Type | Includes | Best For |
|---|---|---|
| Standard Runtime | Spark, Delta Lake, Python, SQL, Scala, R | General data engineering |
| ML Runtime | Standard + ML libraries (TensorFlow, PyTorch, scikit-learn) | Machine learning workloads |
| Photon Runtime | Standard + Photon C++ engine | SQL-heavy and ETL workloads |
| GPU Runtime | ML + GPU drivers and libraries | Deep learning, GPU compute |
LTS (Long-Term Support)
- LTS versions (e.g., 15.4 LTS) receive security updates and bug fixes for 2+ years
- Recommended for production workloads where stability is critical
- Non-LTS versions have shorter support periods
Cluster Autoscaling
Cluster configuration:
Min workers: 2
Max workers: 10
Autoscaling: enabled
How Autoscaling Works
- Cluster starts with the minimum number of workers
- As tasks queue up, workers are added up to the maximum
- When tasks complete and resources are idle, workers are removed
- Optimizes cost by matching resources to actual demand
Autoscaling Considerations
- Scale-up time: Adding nodes takes 1-5 minutes (cloud VM provisioning)
- Scale-down delay: Workers are removed after a configurable idle period
- Minimum workers: Set to handle base workload without scaling
- Maximum workers: Set to cap costs during peak processing
Spot Instances
| Instance Type | Cost | Interruption Risk | Best For |
|---|---|---|---|
| On-demand | Full price | None | Production, time-sensitive jobs |
| Spot | 60-90% discount | May be interrupted by cloud provider | Fault-tolerant batch processing |
Best Practice: Mixed Instance Pool
- Driver node: Always on-demand (interruption would fail the entire job)
- Worker nodes: Use spot instances with on-demand fallback
- Databricks automatically handles spot interruptions and reassigns tasks
Init Scripts
Init scripts run during cluster startup:
#!/bin/bash
# Example: install a custom Python library
pip install custom-library==1.2.3
# Example: configure environment variables
export MY_API_KEY="secret-value"
- Stored in DBFS, Unity Catalog volumes, or workspace files
- Global init scripts: Run on every cluster in the workspace
- Cluster-scoped init scripts: Run only on a specific cluster
- Use for: custom library installation, driver configuration, environment setup
On the Exam: Know that LTS runtimes are recommended for production, autoscaling optimizes cost by matching workers to demand, and spot instances offer significant savings but risk interruption. The driver node should always be on-demand.
Why should the driver node of a Databricks cluster always use on-demand instances rather than spot instances?
Which Databricks Runtime type should be used for a production ETL pipeline that requires long-term stability and support?