A healthcare workload on Foundation Model APIs must satisfy HIPAA plus additional regulatory commitments and has steady production traffic. Which serving mode is the safest fit?

Provisioned throughput Foundation Model APIs. Provisioned throughput supports the broader compliance profile available for Model Serving and provides dedicated, predictable capacity for production. Pay-per-token has more limited compliance coverage, and ad hoc external routing or driver execution do not offer the same regulatory commitments.

Your company uses both OpenAI and Anthropic but wants a single Databricks-governed invocation surface with consistent monitoring and policy. What should you create?

An external model serving endpoint. External model serving endpoints let Databricks act as the governed access layer for third-party providers, giving one endpoint surface where you can apply AI Gateway features such as rate limits, logging, and guardrails. The other options do not centralize provider governance.

A shared LLM endpoint is getting expensive because one team sends far more requests than everyone else. Which AI Gateway feature most directly addresses this?

Rate limits. AI Gateway rate limits cap request or token consumption per user or group so one team cannot monopolize endpoint capacity, making them a direct control for spend and throughput. Tracing, prompt versioning, and vector search do not throttle usage.

Foundation Model APIs & External Models | Free Guide 2026

Foundation Model APIs (FMAPI)

Foundation Model APIs (FMAPI) let you call Databricks-hosted state-of-the-art foundation models from a Model Serving endpoint without deploying weights yourself. FMAPI comes in two capacity and billing modes, and choosing between them is one of the most frequently tested decisions in the deploy domain:

Mode	Bills by	Capacity	Best for
Pay-per-token	Tokens consumed	No reservation; shared	Prototyping, spiky or low volume, evaluation harnesses, first production deploy
Provisioned throughput	Reserved capacity over time	Dedicated inference capacity	Steady production traffic, latency SLOs, fine-tuned or custom base models, stricter compliance (HIPAA)

Pay-per-token is the simplest way to start: no capacity planning, and you pay only for what you use. It is the right answer for a small team prototyping a summarizer with low, spiky traffic, or an evaluation harness that calls a model only a few hundred times a week. Databricks documents that pay-per-token can serve production workloads, but it is not designed for high, guaranteed throughput.

Provisioned throughput allocates dedicated inference capacity so throughput and latency are predictable. It is the recommended choice for essentially all steady production workloads, and it is required when you need to serve fine-tuned models, need performance guarantees, or must satisfy a broader compliance profile. For example, a HIPAA healthcare workload should use provisioned throughput, not pay-per-token. A customer-support chatbot with steady weekday traffic and a strict latency SLO is the canonical provisioned-throughput scenario.

Both FMAPI modes and custom endpoints share an OpenAI-compatible request format, so you can move a prototype from AI Playground to a pay-per-token endpoint and later to provisioned throughput with minimal client change. Note one exam-relevant lifecycle reality: Databricks retires specific hosted models on published dates, so pin and monitor the model you depend on.

External Models

External models let Databricks act as a governed proxy in front of third-party providers such as OpenAI, Anthropic, Azure OpenAI, Cohere, Google, and AWS Bedrock. You create an external model serving endpoint, and every call to a provider flows through that one Databricks-governed surface. This gives a company that uses both OpenAI and Anthropic a single invocation surface with consistent monitoring, policies, and access control, the tested answer over 'a Unity Catalog function that proxies both providers.' Databricks does not charge for the provider tokens themselves; you still owe the provider's fees, but you gain unified governance.

Provider credentials are never hardcoded. They live in Databricks secrets and are referenced by the external endpoint; AI Gateway then centralizes authentication so applications never see raw API keys. Putting a token in a prompt, a notebook cell, or source control is a security antipattern the exam expects you to reject.

AI Gateway: Unified Access, Rate Limiting, and Fallbacks

Mosaic AI Gateway is the central control plane for LLM endpoints, spanning external, FMAPI, and custom endpoints. When several teams call several endpoints and leadership wants centralized usage tracking, policy enforcement, and monitoring, AI Gateway is the answer, not per-application configuration scattered across projects. Its exam-relevant features:

Rate limits: cap requests or tokens per user or group so one heavy team cannot monopolize a shared endpoint or blow the budget. This is the direct fix for 'one team is sending far more requests than everyone else.'
Usage tables: aggregate token usage and cost across served entities for chargeback and budgeting. Distinguish these from inference tables, which capture per-request payloads for auditing and monitoring; usage tables summarize consumption.
Payload logging to inference tables in Unity Catalog for auditing and offline analysis.
Guardrails: input and output safety filters, including personally identifiable information (PII) controls.
Fallbacks: for external models served on one endpoint, AI Gateway automatically fails over to the next served entity when a provider returns 429 or 5xx. Fallback order follows the served-entity list, not traffic percentage, so a 0%-traffic external model can act as a fallback-only target. This is how you make requests fail over from one provider to another for resilience.

Because AI Gateway centralizes authentication, rate limiting, and usage logging, routing multiple providers through it is the standard governance pattern.

Choosing the Serving Surface: Decision Guide

Databricks-hosted base model, simplest start or spiky volume: FMAPI pay-per-token.
Databricks-hosted base or fine-tuned model, steady production, SLO, or HIPAA: FMAPI provisioned throughput.
Third-party provider (OpenAI/Anthropic) under Databricks governance: external model endpoint plus AI Gateway.
Your own custom chain, agent, or fine-tuned weights: custom model serving endpoint (from 5.2).
Central policy, cost tracking, rate limits, or provider failover: AI Gateway over any of the above.

Common Traps

Reaching for provisioned throughput for a tiny evaluation harness; that is overkill, use pay-per-token.
Using pay-per-token for a strict-SLO or HIPAA workload; use provisioned throughput.
Building a custom function to proxy OpenAI or Anthropic instead of an external model endpoint.
Confusing usage tables (aggregated cost and tokens for chargeback) with inference tables (per-request payloads).
Assuming AI Gateway fallback is traffic-percentage based; it follows served-entity order.

Databricks Generative AI Engineer Associate Certification

Databricks Generative AI Engineer Associate

5.3 Foundation Model APIs & External Models

Key Takeaways

Foundation Model APIs (FMAPI)

External Models

AI Gateway: Unified Access, Rate Limiting, and Fallbacks

Choosing the Serving Surface: Decision Guide

Common Traps

Databricks Generative AI Engineer Associate Certification

1Introduction & Exam Strategy

2Design Applications

3Data Preparation

4Application Development

5Assembling & Deploying Applications

6Governance, Evaluation & Monitoring

Databricks Generative AI Engineer Associate

5.3 Foundation Model APIs & External Models

Key Takeaways

Foundation Model APIs (FMAPI)

External Models

AI Gateway: Unified Access, Rate Limiting, and Fallbacks

Choosing the Serving Surface: Decision Guide

Common Traps