6.2 Chat Completions API and Prompt Engineering
Key Takeaways
- The Chat Completions API takes a messages array with three roles: system (behavior and guardrails), user (input), and assistant (prior model turns or few-shot examples).
- On Azure the model parameter is the DEPLOYMENT name, not the model name — a top recurring AI-102 distractor.
- Sampling is controlled by temperature (0-2) OR top_p (0-1); Microsoft says tune one, not both. Use max_tokens, frequency_penalty, presence_penalty, and stop to shape output.
- Prompt-engineering patterns the exam tests: zero-shot, few-shot (user/assistant example pairs), chain-of-thought, and strict output-format specification.
- Streaming (stream=True) returns tokens as deltas to cut perceived latency; tool/function calling lets the model request a structured call your code executes.
Quick Answer: Chat Completions uses system (behavior/guardrails), user (input), and assistant (prior turns or few-shot examples) messages. Steer output with temperature (creativity) or top_p (diversity) — never both — plus max_tokens, frequency_penalty, presence_penalty, and stop. On Azure the
modelargument is the deployment name.
Minimal Call
from openai import AzureOpenAI
client = AzureOpenAI(
api_version="2024-10-21",
azure_endpoint="https://my-openai.openai.azure.com/",
azure_ad_token_provider=token_provider # managed identity preferred
)
resp = client.chat.completions.create(
model="gpt4o-chat", # DEPLOYMENT name, not "gpt-4o"
messages=[
{"role": "system", "content": "You are a concise Azure expert."},
{"role": "user", "content": "What is Azure AI Search?"}
],
temperature=0.7,
max_tokens=500,
stop=None
)
print(resp.choices[0].message.content)
print(resp.usage.total_tokens)
On the Exam: With direct OpenAI you pass
model="gpt-4o"; with Azure OpenAI you pass the deployment name you chose. Expect at least one item where the wrong option is the literal model name.
Message Roles
| Role | Purpose | Example |
|---|---|---|
| system | Persona, scope, format, guardrails | "You are a triage bot. Only answer health questions." |
| user | Human input / data | "What are flu symptoms?" |
| assistant | Prior responses or few-shot answers | "Fever, cough, body aches." |
Multi-turn chat is stateless: you resend the full history every call because the API keeps no memory between requests. Long conversations therefore inflate input-token cost and can overflow the context window — a common scenario asking why costs rise turn over turn.
Sampling and Penalty Parameters
| Parameter | Range | Default | Effect |
|---|---|---|---|
| temperature | 0-2 | 1 | Higher = more random/creative; 0 = near-deterministic |
| top_p | 0-1 | 1 | Nucleus sampling: keep tokens within cumulative top_p |
| max_tokens | 1-limit | model-set | Caps completion length |
| frequency_penalty | -2 to 2 | 0 | Positive reduces repeating tokens already used |
| presence_penalty | -2 to 2 | 0 | Positive pushes toward new topics |
| stop | up to 4 strings | none | Halts generation at a sequence |
| n | 1+ | 1 | Number of alternative completions |
Choosing temperature
| Goal | Setting |
|---|---|
| Data extraction, code, factual Q&A | temperature 0 |
| Balanced helpfulness | 0.3-0.5 |
| Brainstorming, creative copy | 0.7-1.0 |
On the Exam: Microsoft explicitly recommends adjusting temperature OR top_p, not both simultaneously. A question that sets
temperature=0.2, top_p=0.2together is testing this rule — the correct guidance is to modify only one.
Prompt-Engineering Patterns
- Zero-shot — instruction only, no examples.
- Few-shot — supply user/assistant example pairs so the model infers the pattern.
- Chain-of-thought — "Think step by step" elicits intermediate reasoning, improving math/logic accuracy.
- Output-format spec — demand strict JSON; pair with low temperature for reliability.
messages = [
{"role": "system", "content": "Classify sentiment as positive, negative, or neutral."},
{"role": "user", "content": "The food was delicious!"},
{"role": "assistant", "content": "positive"},
{"role": "user", "content": "Service was terrible and slow."},
{"role": "assistant", "content": "negative"},
{"role": "user", "content": "It arrived on time and works perfectly!"}
]
Trap: few-shot examples go in the assistant role, not stuffed into the system message — the exam distinguishes the two.
Streaming
stream = client.chat.completions.create(
model="gpt4o-chat",
messages=[{"role": "user", "content": "Explain ML simply."}],
stream=True
)
for chunk in stream:
if chunk.choices and chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Streaming delivers tokens as deltas the moment they are produced, cutting perceived latency for chat UIs. It does not reduce total tokens or cost; the trade-off is that you cannot inspect the full response (or apply post-generation content checks) until the stream completes.
Tool / Function Calling
Function calling lets the model emit a structured request that your code executes — the model never runs the function itself.
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
}
}]
resp = client.chat.completions.create(
model="gpt4o-chat",
messages=[{"role": "user", "content": "Weather in Seattle?"}],
tools=tools, tool_choice="auto"
)
if resp.choices[0].message.tool_calls:
call = resp.choices[0].message.tool_calls[0]
args = json.loads(call.function.arguments) # {"location": "Seattle, WA"}
The loop is: model returns tool_calls → your app runs the real function → you append a tool role message with the result → call the API again so the model writes the final natural-language answer.
Two behaviors matter for the exam. First, the model only proposes the call; it serializes the arguments as a JSON string you must parse, and your application is solely responsible for executing the work. Second, tool_choice governs the decision: auto lets the model choose whether to call a tool, none forbids tool use, and naming a function forces that specific call. You can register several tools at once and the model may request more than one in a single turn.
When to use which technique
| Situation | Reach for |
|---|---|
| Pull live data or trigger an action | Tool / function calling |
| Enforce a structured task with no examples | Zero-shot + format spec |
| Teach a pattern from samples | Few-shot |
| Improve multi-step math or logic | Chain-of-thought |
| Reduce perceived wait in a chat UI | Streaming |
On the Exam: Function calling is the foundation for the agent pattern in section 6.6 — the agent loop is just function calling plus managed threads and memory. Expect items contrasting
tool_choicevalues and asking who executes the function (always your code, never the service).
In an Azure OpenAI Chat Completions call, what value belongs in the model parameter?
Microsoft's guidance on sampling parameters is to:
A developer notices cost climbing as a chat conversation gets longer, even though replies stay short. Why?
When the model returns a tool_calls object, what executes the requested function?