5.3 Foundation Model APIs & External Models
Key Takeaways
- Foundation Model APIs pay-per-token needs no capacity reservation and suits prototyping, evaluation harnesses, and spiky low-volume traffic.
- Provisioned throughput reserves dedicated inference capacity for steady production, latency SLOs, fine-tuned models, and HIPAA-class compliance.
- External models place a Databricks-governed proxy in front of OpenAI, Anthropic, and other providers, giving a single monitored invocation surface.
- AI Gateway rate limits cap per-user or per-group consumption, while usage tables aggregate token cost for chargeback, distinct from per-request inference tables.
- AI Gateway fallbacks fail external-model requests over on 429 or 5xx by served-entity order, so a 0%-traffic model can be a fallback-only target.
Foundation Model APIs (FMAPI)
Foundation Model APIs (FMAPI) let you call Databricks-hosted state-of-the-art foundation models from a Model Serving endpoint without deploying weights yourself. FMAPI comes in two capacity and billing modes, and choosing between them is one of the most frequently tested decisions in the deploy domain:
| Mode | Bills by | Capacity | Best for |
|---|---|---|---|
| Pay-per-token | Tokens consumed | No reservation; shared | Prototyping, spiky or low volume, evaluation harnesses, first production deploy |
| Provisioned throughput | Reserved capacity over time | Dedicated inference capacity | Steady production traffic, latency SLOs, fine-tuned or custom base models, stricter compliance (HIPAA) |
Pay-per-token is the simplest way to start: no capacity planning, and you pay only for what you use. It is the right answer for a small team prototyping a summarizer with low, spiky traffic, or an evaluation harness that calls a model only a few hundred times a week. Databricks documents that pay-per-token can serve production workloads, but it is not designed for high, guaranteed throughput.
Provisioned throughput allocates dedicated inference capacity so throughput and latency are predictable. It is the recommended choice for essentially all steady production workloads, and it is required when you need to serve fine-tuned models, need performance guarantees, or must satisfy a broader compliance profile. For example, a HIPAA healthcare workload should use provisioned throughput, not pay-per-token. A customer-support chatbot with steady weekday traffic and a strict latency SLO is the canonical provisioned-throughput scenario.
Both FMAPI modes and custom endpoints share an OpenAI-compatible request format, so you can move a prototype from AI Playground to a pay-per-token endpoint and later to provisioned throughput with minimal client change. Note one exam-relevant lifecycle reality: Databricks retires specific hosted models on published dates, so pin and monitor the model you depend on.
External Models
External models let Databricks act as a governed proxy in front of third-party providers such as OpenAI, Anthropic, Azure OpenAI, Cohere, Google, and AWS Bedrock. You create an external model serving endpoint, and every call to a provider flows through that one Databricks-governed surface. This gives a company that uses both OpenAI and Anthropic a single invocation surface with consistent monitoring, policies, and access control, the tested answer over 'a Unity Catalog function that proxies both providers.' Databricks does not charge for the provider tokens themselves; you still owe the provider's fees, but you gain unified governance.
Provider credentials are never hardcoded. They live in Databricks secrets and are referenced by the external endpoint; AI Gateway then centralizes authentication so applications never see raw API keys. Putting a token in a prompt, a notebook cell, or source control is a security antipattern the exam expects you to reject.
AI Gateway: Unified Access, Rate Limiting, and Fallbacks
Mosaic AI Gateway is the central control plane for LLM endpoints, spanning external, FMAPI, and custom endpoints. When several teams call several endpoints and leadership wants centralized usage tracking, policy enforcement, and monitoring, AI Gateway is the answer, not per-application configuration scattered across projects. Its exam-relevant features:
- Rate limits: cap requests or tokens per user or group so one heavy team cannot monopolize a shared endpoint or blow the budget. This is the direct fix for 'one team is sending far more requests than everyone else.'
- Usage tables: aggregate token usage and cost across served entities for chargeback and budgeting. Distinguish these from inference tables, which capture per-request payloads for auditing and monitoring; usage tables summarize consumption.
- Payload logging to inference tables in Unity Catalog for auditing and offline analysis.
- Guardrails: input and output safety filters, including personally identifiable information (PII) controls.
- Fallbacks: for external models served on one endpoint, AI Gateway automatically fails over to the next served entity when a provider returns 429 or 5xx. Fallback order follows the served-entity list, not traffic percentage, so a 0%-traffic external model can act as a fallback-only target. This is how you make requests fail over from one provider to another for resilience.
Because AI Gateway centralizes authentication, rate limiting, and usage logging, routing multiple providers through it is the standard governance pattern.
Choosing the Serving Surface: Decision Guide
- Databricks-hosted base model, simplest start or spiky volume: FMAPI pay-per-token.
- Databricks-hosted base or fine-tuned model, steady production, SLO, or HIPAA: FMAPI provisioned throughput.
- Third-party provider (OpenAI/Anthropic) under Databricks governance: external model endpoint plus AI Gateway.
- Your own custom chain, agent, or fine-tuned weights: custom model serving endpoint (from 5.2).
- Central policy, cost tracking, rate limits, or provider failover: AI Gateway over any of the above.
Common Traps
- Reaching for provisioned throughput for a tiny evaluation harness; that is overkill, use pay-per-token.
- Using pay-per-token for a strict-SLO or HIPAA workload; use provisioned throughput.
- Building a custom function to proxy OpenAI or Anthropic instead of an external model endpoint.
- Confusing usage tables (aggregated cost and tokens for chargeback) with inference tables (per-request payloads).
- Assuming AI Gateway fallback is traffic-percentage based; it follows served-entity order.
A healthcare workload on Foundation Model APIs must satisfy HIPAA plus additional regulatory commitments and has steady production traffic. Which serving mode is the safest fit?
Your company uses both OpenAI and Anthropic but wants a single Databricks-governed invocation surface with consistent monitoring and policy. What should you create?
A shared LLM endpoint is getting expensive because one team sends far more requests than everyone else. Which AI Gateway feature most directly addresses this?