2.1 Translating Business Requirements to a GenAI Design
Key Takeaways
- Design Applications is 14% of the exam (about 6 of 45 scored questions) and covers use-case decomposition, architecture choice, and input/output specification.
- RAG adds fresh, cited knowledge; fine-tuning shifts behavior; prompt-only handles format and style; agents add multi-step tool use and real actions.
- Choose the simplest architecture that meets the requirement: a chain beats an agent for a fixed retrieve-read-answer flow under a tight latency SLA.
- Handle deterministic work such as validation, routing, and exact lookups in plain code or SQL before any LLM call.
- Write the input/output specification and measurable success criteria first, because they drive every later design and evaluation decision.
From Business Problem to GenAI Blueprint
The Design Applications domain is 14% of the Databricks Certified Generative AI Engineer Associate exam, roughly 6 of the 45 scored questions, but it anchors every other domain. A wrong architecture decision here wastes work in Data Preparation, Application Development, and Deployment. The exam hands you a plain-English business requirement, such as 'answer HR policy questions with citations' or 'extract invoice fields into a table', and asks you to translate it into a concrete pipeline of model tasks, components, inputs, outputs, and measurable success criteria. You are graded on judgment, not memorization.
Step 1 - Decompose the use case into subtasks
Databricks expects you to break a use case into discrete subtasks instead of aiming one giant prompt at a single model. A typical enterprise assistant decomposes into input validation, intent classification or routing, retrieval, answer generation, and post-processing such as formatting and guardrails. Decomposition makes a system easier to debug and cheaper to operate, because you can inspect and fix each stage in isolation and send each stage to the smallest model that can do the job.
A heavily tested pattern: handle deterministic work deterministically, before any LLM call. Input validation, routing on a known field, exact-match lookups, and schema checks are cheaper, faster, and far more reliable as ordinary Python or SQL than as an LLM call. Reserve the model for genuinely generative or reasoning steps. When a question asks which step is best handled deterministically before calling an LLM, the answer is almost always validation, routing, or a lookup, not the generation step.
| Subtask | Typical implementation | LLM needed? |
|---|---|---|
| Input validation / schema check | Python / SQL | No |
| Intent classification / routing | Small cheap model or classifier | Sometimes |
| Retrieval | Vector Search + metadata filters | No (uses embeddings) |
| Answer generation | Foundation model | Yes |
| Formatting / guardrails | Code plus light LLM | Sometimes |
Step 2 - Choose the architecture pattern
This is the single most tested judgment in the domain. Learn this decision table cold:
| Signal in the requirement | Best pattern | Why |
|---|---|---|
| Only formatting, tone, or style; no external facts | Prompt-only | No training or retrieval required |
| Proprietary or frequently changing knowledge, with citations | RAG | Retrieval injects fresh, governed context at query time |
| A new behavior, skill, tone, or output format baked in | Fine-tuning | Behavior is baked into weights via LoRA or QLoRA |
| Multiple steps, external tools, or taking real actions | Agent | Tool calling and planning across systems |
RAG (Retrieval-Augmented Generation) is the default when the knowledge is private, changes often, and the answer must cite sources. An HR assistant answering from policy documents that change every week is a textbook RAG case: retrieval always uses the latest indexed documents, and citations fall out naturally. Fine-tuning would freeze last week's policy into the weights and cannot cite; hardcoding answers into the system prompt does not scale and goes stale immediately.
Fine-tuning earns its place when you need to change how the model behaves, such as a consistent tone, a domain style, or a specialized classification skill, not to add facts. The recurring trap is 'RAG adds knowledge; fine-tuning shifts behavior.' If a scenario says 'needs the newest data' or 'must cite', pick RAG, never fine-tuning.
Agents fit when the app must take multiple steps or reach into external systems. A claims-operations app that answers policy questions, retrieves supporting documents, and creates a follow-up task in another system only when needed calls for a tool-calling agent with a retrieval tool and a workflow tool. A prompt-only bot cannot retrieve, and a plain RAG chain cannot create the task.
Step 3 - Prefer the simplest architecture that works
Databricks rewards choosing the least complex pattern that satisfies the requirement, because extra machinery adds latency, cost, and failure modes. If an assistant mostly answers policy questions and only rarely needs a live value, such as an employee's current PTO balance, the best design is RAG for the common path plus a single tool for the occasional lookup, not a fully general multi-tool agent.
Chains versus agents is a favorite comparison. A chain runs a fixed sequence (retrieve, read, answer). An agent decides dynamically which tools to call. When a FAQ assistant has a strict retrieve-read-answer flow and a tight latency SLA, a chain is preferable because its path is deterministic, predictable, and faster; an agent's dynamic tool selection adds unpredictable extra LLM turns and latency. Reach for an agent only when the flow genuinely branches across tools.
Step 4 - Define inputs, outputs, constraints, and success criteria
Before writing code, write down the pipeline specification: the expected inputs, the required outputs (including exact schema when structured), the constraints (latency SLA, cost ceiling, governance and PII rules), and the success criteria. When a question asks which artifact to produce first when translating a use case, the answer is the input/output specification, not a prompt or a model choice, because the spec drives every later decision.
Success criteria must be measurable. 'Helpful' is not testable; 'at least 90% of answers grounded in retrieved context with a citation, p95 latency under 3 seconds, cost under a set amount per 1,000 queries' is. These criteria feed directly into the Evaluation and Monitoring domain later, so making them explicit up front is what separates a designed application from a demo.
An internal HR assistant must answer questions from policy documents that change every week and must cite the source material. Which approach is the best starting point?
A claims-operations app must answer policy questions, retrieve supporting documents, and create a follow-up task in another system only when needed. Which architecture best fits this use case?
When decomposing a GenAI application, which step is usually best handled deterministically before any LLM call?