A customer-facing web app needs a low-latency HTTPS endpoint to send prompts and receive completions from a production model in real time. Which Databricks capability should it call?

Model Serving. Model Serving exposes managed HTTPS endpoints for online inference, which is what an interactive application needs. Storage and ETL features such as DLT, DBFS, and Catalog Explorer handle other layers of the stack but are not the real-time serving interface.

You want one production endpoint to compare two chat models under the same API and gradually shift traffic between them. Which deployment pattern fits best?

Serve both models as separate served entities behind one endpoint and use traffic splitting. Databricks serving supports multiple served entities behind a single endpoint when they share the same API format, such as chat. Traffic splitting then lets you A/B compare or gradually shift traffic without changing the client integration point.

What is the main operational tradeoff of enabling scale-to-zero on a custom model serving endpoint?

The first request after an idle period can incur cold-start latency. Scale-to-zero releases compute when the endpoint is idle to cut cost, so the next request after idle can be slow while compute restarts. Permissions, inference logging, and the model signature are unaffected.

Databricks Model Serving Endpoints — Free Study Guide 2026

What Model Serving Provides

Databricks Model Serving exposes your registered models as managed, low-latency HTTPS endpoints for online (real-time) inference, running on serverless compute so you never manage clusters. When a scenario needs an interactive web or mobile app to send a prompt and receive a completion over REST, Model Serving is the answer, not Delta Live Tables, DBFS mounts, or Catalog Explorer, which solve storage and ETL problems rather than serving.

Three endpoint families appear on the exam and you must distinguish them:

Endpoint type	Backs	Key traits
Custom model serving	Your registered pyfunc/langchain models, agents, fine-tuned models	CPU or GPU compute; scale-to-zero; can be stopped/started
Foundation Model API	Databricks-hosted base models	Pay-per-token or provisioned throughput (see 5.3)
External model	Third-party providers (OpenAI, Anthropic, ...)	Databricks-governed proxy (see 5.3)

Deploying a Registered Model

To deploy, you create a serving endpoint that references a served entity, a specific Unity Catalog model version or alias (for example catalog.schema.model@Champion). Databricks provisions serverless compute, installs the logged dependencies, and validates payloads against the model signature from 5.1. For agents, databricks.agents.deploy() performs this deployment and adds tracing, the Review App, and monitoring in one step.

CPU vs GPU

Custom model endpoints run on CPU or GPU compute. Small wrappers, routing logic, and endpoints that mostly delegate to a Foundation Model API run fine on CPU. GPU compute is for hosting model weights yourself, such as a fine-tuned open LLM or a self-hosted embedding model, where inference is compute-heavy. Choosing GPU when the endpoint only orchestrates external calls wastes money; choosing CPU to host a multi-billion-parameter model will not meet latency targets.

Scale-to-Zero and Autoscaling

Serving endpoints autoscale compute up and down with traffic, so spiky load does not queue indefinitely. Scale-to-zero goes further: an idle custom endpoint releases all compute to save cost, then spins back up on the next request. The tested tradeoff is cold-start latency, meaning the first request after an idle period can be slow or even time out. This is exactly why a production Vector Search flow whose managed-embedding endpoint is scaled to zero shows intermittent first-request timeouts after long idle periods; the fix is to keep the endpoint warm. Scale-to-zero is a property of custom endpoints, not of pay-per-token or external endpoints.

Only custom model serving endpoints can be explicitly stopped and restarted to free resources when idle (when eligible and not mid-update). Pay-per-token, external, and route-optimized external endpoints do not offer that manual stop and start control.

Querying an Endpoint

You can call an endpoint several ways, all hitting the same OpenAI-compatible surface:

REST: POST /serving-endpoints/{name}/invocations with a JSON payload.
SDK / Python: the Databricks SDK or the MLflow deployments client.
LangChain: the ChatDatabricks class calls a Databricks-hosted chat model whether it is a Foundation Model API or a custom serving endpoint.
SQL: ai_query() invokes a serving endpoint row-wise, for example inside a Lakeflow materialized view running batch inference over 50 million rows.
AI Playground for interactive testing.

Authentication Note

Standard endpoints accept personal access tokens (PATs) or OAuth. Route-optimized endpoints, a lower-latency and higher-throughput option, support OAuth tokens only, not PATs. A route-optimized endpoint that rejects an otherwise valid PAT is behaving as designed; switch the caller to OAuth.

A/B Testing and Traffic Splitting

Databricks serving supports multiple served entities behind one endpoint as long as they expose the same API format, such as two chat models. You assign a traffic percentage to each served entity. This enables:

A/B testing: split traffic 50/50 to compare two chat models under one client integration.
Canary or gradual rollout: send 5% to a new version, watch live quality, latency, and cost, then increase.
Shadow deployment to evaluate a candidate on real traffic without returning its responses to users.
Fast rollback: keep the previous version deployed and shift traffic back the moment a new version regresses (for example a hallucination spike or a doubled p95 latency). Rollback by traffic shift is the exam's preferred incident response, restore service first and diagnose afterward.

A served entity set to 0% traffic remains deployed as a fallback-only target. (For external models, automatic failover on 429 or 5xx is an AI Gateway feature ordered by the served-entity list, covered in 5.3.)

Cost and Governance Controls

Serverless budget policies attach custom tags to a serving endpoint for granular billing attribution; they do not change inference behavior or autoscaling.
Endpoints are consumed by a service principal, and access to Unity Catalog tables the endpoint reads is governed by least-privilege grants on that principal.
Removing a serving-endpoint resource from a Databricks App revokes the app service principal's access to the endpoint; it does not delete the endpoint itself.

Common Traps

Picking DLT, DBFS, or Catalog Explorer instead of Model Serving for a real-time HTTPS need.
Forgetting that scale-to-zero causes cold starts.
Using a PAT against a route-optimized endpoint.
Assuming traffic splitting needs different clients; it is one endpoint, one API, and multiple served entities.

Databricks Generative AI Engineer Associate Certification

Databricks Generative AI Engineer Associate

5.2 Databricks Model Serving Endpoints

Key Takeaways

What Model Serving Provides

Deploying a Registered Model

CPU vs GPU

Scale-to-Zero and Autoscaling

Querying an Endpoint

Authentication Note

A/B Testing and Traffic Splitting

Cost and Governance Controls

Common Traps

Databricks Generative AI Engineer Associate Certification

1Introduction & Exam Strategy

2Design Applications

3Data Preparation

4Application Development

5Assembling & Deploying Applications

6Governance, Evaluation & Monitoring

Databricks Generative AI Engineer Associate

5.2 Databricks Model Serving Endpoints

Key Takeaways

What Model Serving Provides

Deploying a Registered Model

CPU vs GPU

Scale-to-Zero and Autoscaling

Querying an Endpoint

Authentication Note

A/B Testing and Traffic Splitting

Cost and Governance Controls

Common Traps