5.2 Databricks Model Serving Endpoints

Key Takeaways

  • Model Serving delivers managed, low-latency HTTPS endpoints on serverless compute for real-time inference, the correct choice over DLT, DBFS, or Catalog Explorer.
  • Custom model serving endpoints run on CPU or GPU, support scale-to-zero, and are the only endpoint type you can manually stop and restart.
  • Scale-to-zero cuts idle cost but adds cold-start latency, which can time out the first request after an idle period.
  • One endpoint can host multiple served entities of the same API format with traffic splitting, enabling A/B tests, canary rollout, and instant rollback by shifting traffic back.
  • Route-optimized serving endpoints accept OAuth tokens only, not personal access tokens (PATs).
Last updated: July 2026

What Model Serving Provides

Databricks Model Serving exposes your registered models as managed, low-latency HTTPS endpoints for online (real-time) inference, running on serverless compute so you never manage clusters. When a scenario needs an interactive web or mobile app to send a prompt and receive a completion over REST, Model Serving is the answer, not Delta Live Tables, DBFS mounts, or Catalog Explorer, which solve storage and ETL problems rather than serving.

Three endpoint families appear on the exam and you must distinguish them:

Endpoint typeBacksKey traits
Custom model servingYour registered pyfunc/langchain models, agents, fine-tuned modelsCPU or GPU compute; scale-to-zero; can be stopped/started
Foundation Model APIDatabricks-hosted base modelsPay-per-token or provisioned throughput (see 5.3)
External modelThird-party providers (OpenAI, Anthropic, ...)Databricks-governed proxy (see 5.3)

Deploying a Registered Model

To deploy, you create a serving endpoint that references a served entity, a specific Unity Catalog model version or alias (for example catalog.schema.model@Champion). Databricks provisions serverless compute, installs the logged dependencies, and validates payloads against the model signature from 5.1. For agents, databricks.agents.deploy() performs this deployment and adds tracing, the Review App, and monitoring in one step.

CPU vs GPU

Custom model endpoints run on CPU or GPU compute. Small wrappers, routing logic, and endpoints that mostly delegate to a Foundation Model API run fine on CPU. GPU compute is for hosting model weights yourself, such as a fine-tuned open LLM or a self-hosted embedding model, where inference is compute-heavy. Choosing GPU when the endpoint only orchestrates external calls wastes money; choosing CPU to host a multi-billion-parameter model will not meet latency targets.

Scale-to-Zero and Autoscaling

Serving endpoints autoscale compute up and down with traffic, so spiky load does not queue indefinitely. Scale-to-zero goes further: an idle custom endpoint releases all compute to save cost, then spins back up on the next request. The tested tradeoff is cold-start latency, meaning the first request after an idle period can be slow or even time out. This is exactly why a production Vector Search flow whose managed-embedding endpoint is scaled to zero shows intermittent first-request timeouts after long idle periods; the fix is to keep the endpoint warm. Scale-to-zero is a property of custom endpoints, not of pay-per-token or external endpoints.

Only custom model serving endpoints can be explicitly stopped and restarted to free resources when idle (when eligible and not mid-update). Pay-per-token, external, and route-optimized external endpoints do not offer that manual stop and start control.

Querying an Endpoint

You can call an endpoint several ways, all hitting the same OpenAI-compatible surface:

  • REST: POST /serving-endpoints/{name}/invocations with a JSON payload.
  • SDK / Python: the Databricks SDK or the MLflow deployments client.
  • LangChain: the ChatDatabricks class calls a Databricks-hosted chat model whether it is a Foundation Model API or a custom serving endpoint.
  • SQL: ai_query() invokes a serving endpoint row-wise, for example inside a Lakeflow materialized view running batch inference over 50 million rows.
  • AI Playground for interactive testing.

Authentication Note

Standard endpoints accept personal access tokens (PATs) or OAuth. Route-optimized endpoints, a lower-latency and higher-throughput option, support OAuth tokens only, not PATs. A route-optimized endpoint that rejects an otherwise valid PAT is behaving as designed; switch the caller to OAuth.

A/B Testing and Traffic Splitting

Databricks serving supports multiple served entities behind one endpoint as long as they expose the same API format, such as two chat models. You assign a traffic percentage to each served entity. This enables:

  • A/B testing: split traffic 50/50 to compare two chat models under one client integration.
  • Canary or gradual rollout: send 5% to a new version, watch live quality, latency, and cost, then increase.
  • Shadow deployment to evaluate a candidate on real traffic without returning its responses to users.
  • Fast rollback: keep the previous version deployed and shift traffic back the moment a new version regresses (for example a hallucination spike or a doubled p95 latency). Rollback by traffic shift is the exam's preferred incident response, restore service first and diagnose afterward.

A served entity set to 0% traffic remains deployed as a fallback-only target. (For external models, automatic failover on 429 or 5xx is an AI Gateway feature ordered by the served-entity list, covered in 5.3.)

Cost and Governance Controls

  • Serverless budget policies attach custom tags to a serving endpoint for granular billing attribution; they do not change inference behavior or autoscaling.
  • Endpoints are consumed by a service principal, and access to Unity Catalog tables the endpoint reads is governed by least-privilege grants on that principal.
  • Removing a serving-endpoint resource from a Databricks App revokes the app service principal's access to the endpoint; it does not delete the endpoint itself.

Common Traps

  • Picking DLT, DBFS, or Catalog Explorer instead of Model Serving for a real-time HTTPS need.
  • Forgetting that scale-to-zero causes cold starts.
  • Using a PAT against a route-optimized endpoint.
  • Assuming traffic splitting needs different clients; it is one endpoint, one API, and multiple served entities.
Test Your Knowledge

A customer-facing web app needs a low-latency HTTPS endpoint to send prompts and receive completions from a production model in real time. Which Databricks capability should it call?

A
B
C
D
Test Your Knowledge

You want one production endpoint to compare two chat models under the same API and gradually shift traffic between them. Which deployment pattern fits best?

A
B
C
D
Test Your Knowledge

What is the main operational tradeoff of enabling scale-to-zero on a custom model serving endpoint?

A
B
C
D