2.3 Selecting Models & Components for a Use Case

Key Takeaways

  • Right-size the model: pick the smallest, cheapest, fastest model that meets the task's quality bar, not the biggest available.
  • Route decomposed subtasks to different-sized models - a tiny model for routing, a strong model only for final generation - to cut cost and latency.
  • Query and index must use the same embedding model, and its output dimension must match the Vector Search index.
  • For multilingual retrieval, the embedding model must support every language in both the corpus and the queries.
  • Foundation Model APIs serve hosted base models; AI Gateway adds usage tracking and governance in front of endpoints.
Last updated: July 2026

Selecting Foundation Models, Embeddings, and Components

Once the architecture is chosen, you pick the concrete pieces: which foundation model generates answers, which embedding model powers retrieval, and which retrievers, tools, and chains wire them together. Databricks tests this as a right-sizing exercise. The best model is rarely the biggest one; it is the smallest, cheapest, fastest model that clears the quality bar for the specific task.

Match the foundation model to the task

Different subtasks need different capabilities. A narrow classification task, such as labeling support tickets into five categories with tight latency and cost limits and no open-ended reasoning, is best served by a small, fast model, not a frontier chat model. Conversely, multi-step reasoning or nuanced drafting justifies a larger model. When you decompose an app into intent classification, retrieval, and answer generation, a proven cost strategy is to route each subtask to the smallest capable model: a tiny model for routing and a strong model only for final generation. This lowers cost and latency without hurting quality.

Modality matters too. A document-QA assistant that must answer questions about charts embedded in PDFs needs a model with vision or multimodal capability; a text-only model cannot read the chart. Always confirm whether the inputs are purely text before assuming a text model suffices.

Foundation model selection criteria

Weigh these dimensions for every generation step:

CriterionQuestion to askTypical trap
Task fitDoes the model do this task well (classification, chat, reasoning, code)?Using a chat model for a trivial label
Context lengthWill the prompt plus retrieved context plus output fit the window?Retrieved context overflowing the window
CostWhat is the per-token or per-query cost at scale?Paying frontier prices for easy tasks
LatencyCan it meet the p95 SLA?A large model breaking a real-time SLA
Quality / groundednessDoes it stay grounded on your data?Choosing on offline helpfulness alone
ModalityText only, or images, tables, audio?A text model on chart or image inputs

The recurring exam scenario: Model A is slightly more helpful offline, but Model B has similar groundedness with much lower latency and cost. The strongest next step is not to ship A on a single offline number; it is to evaluate B against your real success criteria, ideally an online A/B or task-specific evaluation, because a tiny offline edge rarely justifies large cost and latency penalties. Right-sizing beats chasing leaderboard helpfulness.

How you access models on Databricks

  • Foundation Model APIs provide hosted, ready-to-call base models (pay-per-token or provisioned throughput), the default way to call an LLM without managing infrastructure.
  • External models route to third-party providers through the same interface.
  • AI Gateway sits in front of endpoints for usage tracking, rate limiting, and governance, an area the March 18, 2026 blueprint emphasizes.
  • Model Serving hosts your own packaged model or chain as a real-time endpoint (covered in the deployment domain).

Selecting the embedding model

Retrieval quality starts with the embedding model, which converts chunks and queries into vectors for Mosaic AI Vector Search. The capability that matters most is producing semantically meaningful vectors for your domain and content type. Two hard rules the exam checks:

  1. Query and index must use the same embedding model. Embeddings from different models are not comparable, so the model that indexes documents must also embed the incoming query.
  2. Output dimension must match the Vector Search index. The index is created for a fixed vector dimension, so the embedding model's output size has to match it.

Domain and language drive the choice. For multilingual semantic search, the single most important factor is that the embedding model supports the languages in the knowledge base and the query languages; an English-only model retrieves poorly on non-English content no matter how strong it is in English. Also weigh the maximum input length, which must cover your chunk size, and cost at indexing scale.

Choosing components: retrievers, tools, and chains

With models chosen, assemble the components:

  • Retriever: wraps Vector Search with a top-k value and optional metadata filters and re-ranking, feeding context into the prompt template. When users need the exact wording of a passage, precise retrieval (right chunking and top-k) matters more than the generator.
  • Tools: functions an agent can call, such as a database lookup, an API, or a ticket creator. Define tools in the order they must run and describe each clearly so the model calls the right one; for 'look up customer, fetch orders, draft email', define and invoke them in that dependency order.
  • Chain: the fixed wiring of prompt template plus retriever plus LLM that produces a grounded, cited answer. The minimum RAG chain is a prompt template, retrieved context, and the LLM.

The right-sizing principle

Every selection is a trade-off among quality, cost, and latency. Start from the success criteria in your design spec, pick the smallest model and simplest components that clear the bar, and scale up only where evaluation shows a real gap. This disciplined, measurement-driven selection is exactly the engineering judgment the Design Applications domain rewards.

Test Your Knowledge

A company is building semantic search over a multilingual knowledge base. Which factor matters most when choosing the embedding model?

A
B
C
D
Test Your Knowledge

A team needs to classify support tickets into five labels with tight latency and cost limits, and the task does not require open-ended reasoning. Which model choice is usually best?

A
B
C
D