2.2 Prompt Engineering Strategies
Key Takeaways
- System prompts set durable behavior, persona, output format, and grounding rules; user prompts carry the per-turn request.
- Zero-shot is cheapest, few-shot fixes format and style problems, and chain-of-thought improves multi-step reasoning at higher token cost.
- For confident off-context answers in RAG, instruct the model to answer only from retrieved context and abstain otherwise - a prompt fix, not a model swap.
- Lower temperature toward 0 for run-to-run consistency; specify an explicit schema plus an example for reliable JSON or XML output.
- Iterate prompts empirically in the AI Playground, changing one variable at a time and keeping prompts under version control.
Prompt Engineering for Reliable, Grounded Output
Prompt engineering is the cheapest lever in the GenAI toolbox: no training, no infrastructure, and instant iteration. The exam treats prompting as a design skill, matching a prompting technique to a requirement and knowing the failure modes each technique introduces. Databricks' AI Playground is the recommended place to iterate prompts quickly before wiring them into a chain or endpoint.
System prompts versus user prompts
Every chat request is built from roles. The system prompt sets durable behavior, including persona, rules, tone, output format, and grounding constraints, and it persists across the conversation. The user prompt carries the specific request or question for that turn. Put stable instructions such as 'You are an HR assistant; answer only from the provided context and cite the source' in the system prompt, and the variable question in the user prompt. A common design error is stuffing everything into one user message, which makes rules easy for the model to ignore and hard to reuse across requests.
Zero-shot, few-shot, and chain-of-thought
- Zero-shot: instruction only, no examples. Best for simple, unambiguous tasks and lowest token cost.
- Few-shot: include a handful of input-output examples in the prompt. Best when you need a specific format, label set, or style the model keeps missing; examples teach by demonstration without any training.
- Chain-of-thought (CoT): ask the model to reason step by step before answering. Improves accuracy on multi-step reasoning and math, at the cost of more tokens and latency. For structured or classification tasks you usually do not want visible reasoning in the final payload.
Choosing among them is a trade-off: zero-shot is cheapest and fastest, few-shot buys reliability of format, and CoT buys reasoning accuracy but costs tokens. Match the technique to what is actually failing rather than applying the most elaborate option by default.
A worked example clarifies the escalation. Suppose a classifier must return one of three sentiment labels. Start zero-shot with a clear instruction; if the model occasionally invents a fourth label or wraps the answer in a sentence, add three few-shot examples showing exactly the label-only output you want, which usually fixes it without any code change. Only if the task also needed multi-step judgment, say weighing conflicting signals, would you add chain-of-thought - and even then you would keep the reasoning internal and return just the label. Escalate technique only as far as the observed failure requires.
Structured output
Many production steps must return machine-readable output, such as JSON or XML that downstream code parses. To make output stable, do three things: state the exact schema and field names, show a concrete example of the desired output, and instruct the model to return only that structure with no surrounding prose. To get a stable JSON object with fields like issue_type, priority, and needs_human_review, the most effective single change is to specify the schema explicitly and include an example, not to raise temperature or add more prose. The same holds for a strict XML envelope such as a summary tag followed by a tags tag: the most reliable prompt shows the exact template and a filled-in example so the model mirrors it. Where the platform supports it, structured-output or function-calling constraints enforce the schema even more reliably than instructions alone.
Prompt templates
A prompt template is a reusable scaffold with placeholders, for example a system template plus a user template that inserts the retrieved context and the question and instructs the model to answer only from that context and cite its sources. Templates are how RAG chains inject retrieved context at run time and how teams keep prompts version-controlled and testable. On the exam, the minimum RAG chain is a prompt template, the retrieved context, and the LLM - the template is what binds the question to the context.
Failure modes and mitigations
The exam loves 'users complain that X; which prompt change helps?' Learn these pairs:
| Failure mode | Symptom | Primary mitigation |
|---|---|---|
| Hallucination / ungrounded answer | Confident answers outside the retrieved context | Instruct 'answer only from the provided context; if it is not there, say you do not know'; ground and cite |
| Format drift | Output shape varies; JSON breaks parsers | Specify schema plus an example; use structured output or function calling |
| Run-to-run inconsistency | Same input yields different answers | Lower temperature toward 0 for deterministic tasks |
| Verbosity / leaked reasoning | Extra prose around the payload | Ask for only the structure; keep CoT out of the returned object |
| Ignored rules | Model disregards stated constraints | Move rules into the system prompt; make them explicit and ordered |
| Prompt injection | Retrieved or user text overrides instructions | Separate instructions from data; add guardrails (Governance domain) |
Two of these appear again and again. First, grounding: in a RAG app the fix for confident off-context answers is a prompt instruction to use only retrieved context and to abstain when the answer is not present - a prompt fix, not a model swap. Second, temperature: when outputs drift between runs and the team wants consistency, the first move is to lower the temperature, which reduces sampling randomness; leave few-shot examples and schema for format problems, not consistency problems. Distinguishing 'format is wrong' (schema and examples) from 'answer varies' (temperature) from 'answer is invented' (grounding) is exactly the discrimination the exam is testing.
Iterate empirically
Prompt engineering is empirical. Change one variable at a time, test against representative inputs in the AI Playground, and keep the prompt under version control so you can evaluate and roll back. This connects directly to the CI/CD-for-prompts theme that the March 18, 2026 blueprint update emphasizes.
In a RAG app, users complain that the model confidently answers questions that are not covered by the retrieved context. Which prompt update is best?
Prompt outputs drift between runs for the same task, and the team wants more consistency. Which change usually helps first?
You need an LLM to return a stable JSON object with fields issue_type, priority, and needs_human_review. Which prompt change is most effective?