6.2 Chat Completions API and Prompt Engineering

Key Takeaways

  • The Chat Completions API takes a messages array with three roles: system (behavior and guardrails), user (input), and assistant (prior model turns or few-shot examples).
  • On Azure the model parameter is the DEPLOYMENT name, not the model name — a top recurring AI-102 distractor.
  • Sampling is controlled by temperature (0-2) OR top_p (0-1); Microsoft says tune one, not both. Use max_tokens, frequency_penalty, presence_penalty, and stop to shape output.
  • Prompt-engineering patterns the exam tests: zero-shot, few-shot (user/assistant example pairs), chain-of-thought, and strict output-format specification.
  • Streaming (stream=True) returns tokens as deltas to cut perceived latency; tool/function calling lets the model request a structured call your code executes.
Last updated: June 2026

Quick Answer: Chat Completions uses system (behavior/guardrails), user (input), and assistant (prior turns or few-shot examples) messages. Steer output with temperature (creativity) or top_p (diversity) — never both — plus max_tokens, frequency_penalty, presence_penalty, and stop. On Azure the model argument is the deployment name.

Minimal Call

from openai import AzureOpenAI

client = AzureOpenAI(
    api_version="2024-10-21",
    azure_endpoint="https://my-openai.openai.azure.com/",
    azure_ad_token_provider=token_provider  # managed identity preferred
)

resp = client.chat.completions.create(
    model="gpt4o-chat",            # DEPLOYMENT name, not "gpt-4o"
    messages=[
        {"role": "system", "content": "You are a concise Azure expert."},
        {"role": "user", "content": "What is Azure AI Search?"}
    ],
    temperature=0.7,
    max_tokens=500,
    stop=None
)
print(resp.choices[0].message.content)
print(resp.usage.total_tokens)

On the Exam: With direct OpenAI you pass model="gpt-4o"; with Azure OpenAI you pass the deployment name you chose. Expect at least one item where the wrong option is the literal model name.

Message Roles

RolePurposeExample
systemPersona, scope, format, guardrails"You are a triage bot. Only answer health questions."
userHuman input / data"What are flu symptoms?"
assistantPrior responses or few-shot answers"Fever, cough, body aches."

Multi-turn chat is stateless: you resend the full history every call because the API keeps no memory between requests. Long conversations therefore inflate input-token cost and can overflow the context window — a common scenario asking why costs rise turn over turn.

Sampling and Penalty Parameters

ParameterRangeDefaultEffect
temperature0-21Higher = more random/creative; 0 = near-deterministic
top_p0-11Nucleus sampling: keep tokens within cumulative top_p
max_tokens1-limitmodel-setCaps completion length
frequency_penalty-2 to 20Positive reduces repeating tokens already used
presence_penalty-2 to 20Positive pushes toward new topics
stopup to 4 stringsnoneHalts generation at a sequence
n1+1Number of alternative completions

Choosing temperature

GoalSetting
Data extraction, code, factual Q&Atemperature 0
Balanced helpfulness0.3-0.5
Brainstorming, creative copy0.7-1.0

On the Exam: Microsoft explicitly recommends adjusting temperature OR top_p, not both simultaneously. A question that sets temperature=0.2, top_p=0.2 together is testing this rule — the correct guidance is to modify only one.

Prompt-Engineering Patterns

  • Zero-shot — instruction only, no examples.
  • Few-shot — supply user/assistant example pairs so the model infers the pattern.
  • Chain-of-thought — "Think step by step" elicits intermediate reasoning, improving math/logic accuracy.
  • Output-format spec — demand strict JSON; pair with low temperature for reliability.
messages = [
  {"role": "system", "content": "Classify sentiment as positive, negative, or neutral."},
  {"role": "user", "content": "The food was delicious!"},
  {"role": "assistant", "content": "positive"},
  {"role": "user", "content": "Service was terrible and slow."},
  {"role": "assistant", "content": "negative"},
  {"role": "user", "content": "It arrived on time and works perfectly!"}
]

Trap: few-shot examples go in the assistant role, not stuffed into the system message — the exam distinguishes the two.

Streaming

stream = client.chat.completions.create(
    model="gpt4o-chat",
    messages=[{"role": "user", "content": "Explain ML simply."}],
    stream=True
)
for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Streaming delivers tokens as deltas the moment they are produced, cutting perceived latency for chat UIs. It does not reduce total tokens or cost; the trade-off is that you cannot inspect the full response (or apply post-generation content checks) until the stream completes.

Tool / Function Calling

Function calling lets the model emit a structured request that your code executes — the model never runs the function itself.

tools = [{
  "type": "function",
  "function": {
    "name": "get_weather",
    "description": "Get current weather for a city",
    "parameters": {
      "type": "object",
      "properties": {
        "location": {"type": "string"},
        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
      },
      "required": ["location"]
    }
  }
}]

resp = client.chat.completions.create(
    model="gpt4o-chat",
    messages=[{"role": "user", "content": "Weather in Seattle?"}],
    tools=tools, tool_choice="auto"
)
if resp.choices[0].message.tool_calls:
    call = resp.choices[0].message.tool_calls[0]
    args = json.loads(call.function.arguments)  # {"location": "Seattle, WA"}

The loop is: model returns tool_calls → your app runs the real function → you append a tool role message with the result → call the API again so the model writes the final natural-language answer.

Two behaviors matter for the exam. First, the model only proposes the call; it serializes the arguments as a JSON string you must parse, and your application is solely responsible for executing the work. Second, tool_choice governs the decision: auto lets the model choose whether to call a tool, none forbids tool use, and naming a function forces that specific call. You can register several tools at once and the model may request more than one in a single turn.

When to use which technique

SituationReach for
Pull live data or trigger an actionTool / function calling
Enforce a structured task with no examplesZero-shot + format spec
Teach a pattern from samplesFew-shot
Improve multi-step math or logicChain-of-thought
Reduce perceived wait in a chat UIStreaming

On the Exam: Function calling is the foundation for the agent pattern in section 6.6 — the agent loop is just function calling plus managed threads and memory. Expect items contrasting tool_choice values and asking who executes the function (always your code, never the service).

Test Your Knowledge

In an Azure OpenAI Chat Completions call, what value belongs in the model parameter?

A
B
C
D
Test Your Knowledge

Microsoft's guidance on sampling parameters is to:

A
B
C
D
Test Your Knowledge

A developer notices cost climbing as a chat conversation gets longer, even though replies stay short. Why?

A
B
C
D
Test Your Knowledge

When the model returns a tool_calls object, what executes the requested function?

A
B
C
D