2.3 Selecting Models & Components for a Use Case

Key Takeaways

Right-size the model: pick the smallest, cheapest, fastest model that meets the task's quality bar, not the biggest available.
Route decomposed subtasks to different-sized models - a tiny model for routing, a strong model only for final generation - to cut cost and latency.
Query and index must use the same embedding model, and its output dimension must match the Vector Search index.
For multilingual retrieval, the embedding model must support every language in both the corpus and the queries.
Foundation Model APIs serve hosted base models; AI Gateway adds usage tracking and governance in front of endpoints.

Last updated: July 2026

Selecting Foundation Models, Embeddings, and Components

Once the architecture is chosen, you pick the concrete pieces: which foundation model generates answers, which embedding model powers retrieval, and which retrievers, tools, and chains wire them together. Databricks tests this as a right-sizing exercise. The best model is rarely the biggest one; it is the smallest, cheapest, fastest model that clears the quality bar for the specific task.

Match the foundation model to the task

Different subtasks need different capabilities. A narrow classification task, such as labeling support tickets into five categories with tight latency and cost limits and no open-ended reasoning, is best served by a small, fast model, not a frontier chat model. Conversely, multi-step reasoning or nuanced drafting justifies a larger model. When you decompose an app into intent classification, retrieval, and answer generation, a proven cost strategy is to route each subtask to the smallest capable model: a tiny model for routing and a strong model only for final generation. This lowers cost and latency without hurting quality.

Modality matters too. A document-QA assistant that must answer questions about charts embedded in PDFs needs a model with vision or multimodal capability; a text-only model cannot read the chart. Always confirm whether the inputs are purely text before assuming a text model suffices.

Foundation model selection criteria

Weigh these dimensions for every generation step:

Criterion	Question to ask	Typical trap
Task fit	Does the model do this task well (classification, chat, reasoning, code)?	Using a chat model for a trivial label
Context length	Will the prompt plus retrieved context plus output fit the window?	Retrieved context overflowing the window
Cost	What is the per-token or per-query cost at scale?	Paying frontier prices for easy tasks
Latency	Can it meet the p95 SLA?	A large model breaking a real-time SLA
Quality / groundedness	Does it stay grounded on your data?	Choosing on offline helpfulness alone
Modality	Text only, or images, tables, audio?	A text model on chart or image inputs

The recurring exam scenario: Model A is slightly more helpful offline, but Model B has similar groundedness with much lower latency and cost. The strongest next step is not to ship A on a single offline number; it is to evaluate B against your real success criteria, ideally an online A/B or task-specific evaluation, because a tiny offline edge rarely justifies large cost and latency penalties. Right-sizing beats chasing leaderboard helpfulness.

How you access models on Databricks

Foundation Model APIs provide hosted, ready-to-call base models (pay-per-token or provisioned throughput), the default way to call an LLM without managing infrastructure.
External models route to third-party providers through the same interface.
AI Gateway sits in front of endpoints for usage tracking, rate limiting, and governance, an area the March 18, 2026 blueprint emphasizes.
Model Serving hosts your own packaged model or chain as a real-time endpoint (covered in the deployment domain).

Selecting the embedding model

Retrieval quality starts with the embedding model, which converts chunks and queries into vectors for Mosaic AI Vector Search. The capability that matters most is producing semantically meaningful vectors for your domain and content type. Two hard rules the exam checks:

Query and index must use the same embedding model. Embeddings from different models are not comparable, so the model that indexes documents must also embed the incoming query.
Output dimension must match the Vector Search index. The index is created for a fixed vector dimension, so the embedding model's output size has to match it.

Domain and language drive the choice. For multilingual semantic search, the single most important factor is that the embedding model supports the languages in the knowledge base and the query languages; an English-only model retrieves poorly on non-English content no matter how strong it is in English. Also weigh the maximum input length, which must cover your chunk size, and cost at indexing scale.

Choosing components: retrievers, tools, and chains

With models chosen, assemble the components:

Retriever: wraps Vector Search with a top-k value and optional metadata filters and re-ranking, feeding context into the prompt template. When users need the exact wording of a passage, precise retrieval (right chunking and top-k) matters more than the generator.
Tools: functions an agent can call, such as a database lookup, an API, or a ticket creator. Define tools in the order they must run and describe each clearly so the model calls the right one; for 'look up customer, fetch orders, draft email', define and invoke them in that dependency order.
Chain: the fixed wiring of prompt template plus retriever plus LLM that produces a grounded, cited answer. The minimum RAG chain is a prompt template, retrieved context, and the LLM.

The right-sizing principle

Every selection is a trade-off among quality, cost, and latency. Start from the success criteria in your design spec, pick the smallest model and simplest components that clear the bar, and scale up only where evaluation shows a real gap. This disciplined, measurement-driven selection is exactly the engineering judgment the Design Applications domain rewards.

Test Your Knowledge

A company is building semantic search over a multilingual knowledge base. Which factor matters most when choosing the embedding model?

The model's maximum output token limit

Whether the model supports function calling

Whether the embedding model supports all the languages in the knowledge base and the queries

The model's chat helpfulness leaderboard score

Test Your Knowledge

A team needs to classify support tickets into five labels with tight latency and cost limits, and the task does not require open-ended reasoning. Which model choice is usually best?

A small, fast model sized to the narrow classification task

The largest available frontier chat model for maximum quality

A multimodal vision model to be safe

A model fine-tuned and retrained from scratch every night

Up Next

3.1 Document Extraction, Cleaning & Chunking

Data Preparation

Databricks Generative AI Engineer Associate Certification

Databricks Generative AI Engineer Associate

2.3 Selecting Models & Components for a Use Case

Key Takeaways

Selecting Foundation Models, Embeddings, and Components

Match the foundation model to the task

Foundation model selection criteria

How you access models on Databricks

Selecting the embedding model

Choosing components: retrievers, tools, and chains

The right-sizing principle

Databricks Generative AI Engineer Associate Certification

1Introduction & Exam Strategy

2Design Applications

3Data Preparation

4Application Development

5Assembling & Deploying Applications

6Governance, Evaluation & Monitoring

Databricks Generative AI Engineer Associate

2.3 Selecting Models & Components for a Use Case

Key Takeaways

Selecting Foundation Models, Embeddings, and Components

Match the foundation model to the task

Foundation model selection criteria

How you access models on Databricks

Selecting the embedding model

Choosing components: retrievers, tools, and chains

The right-sizing principle