6.2 Chat Completions API and Prompt Engineering

Key Takeaways

The Chat Completions API takes a messages array with three roles: system (behavior and guardrails), user (input), and assistant (prior model turns or few-shot examples).
On Azure the model parameter is the DEPLOYMENT name, not the model name — a top recurring AI-102 distractor.
Sampling is controlled by temperature (0-2) OR top_p (0-1); Microsoft says tune one, not both. Use max_tokens, frequency_penalty, presence_penalty, and stop to shape output.
Prompt-engineering patterns the exam tests: zero-shot, few-shot (user/assistant example pairs), chain-of-thought, and strict output-format specification.
Streaming (stream=True) returns tokens as deltas to cut perceived latency; tool/function calling lets the model request a structured call your code executes.

Last updated: June 2026

Quick Answer: Chat Completions uses system (behavior/guardrails), user (input), and assistant (prior turns or few-shot examples) messages. Steer output with temperature (creativity) or top_p (diversity) — never both — plus max_tokens, frequency_penalty, presence_penalty, and stop. On Azure the model argument is the deployment name.

Minimal Call

from openai import AzureOpenAI

client = AzureOpenAI(
    api_version="2024-10-21",
    azure_endpoint="https://my-openai.openai.azure.com/",
    azure_ad_token_provider=token_provider  # managed identity preferred
)

resp = client.chat.completions.create(
    model="gpt4o-chat",            # DEPLOYMENT name, not "gpt-4o"
    messages=[
        {"role": "system", "content": "You are a concise Azure expert."},
        {"role": "user", "content": "What is Azure AI Search?"}
    ],
    temperature=0.7,
    max_tokens=500,
    stop=None
)
print(resp.choices[0].message.content)
print(resp.usage.total_tokens)

On the Exam: With direct OpenAI you pass model="gpt-4o"; with Azure OpenAI you pass the deployment name you chose. Expect at least one item where the wrong option is the literal model name.

Message Roles

Role	Purpose	Example
system	Persona, scope, format, guardrails	"You are a triage bot. Only answer health questions."
user	Human input / data	"What are flu symptoms?"
assistant	Prior responses or few-shot answers	"Fever, cough, body aches."

Multi-turn chat is stateless: you resend the full history every call because the API keeps no memory between requests. Long conversations therefore inflate input-token cost and can overflow the context window — a common scenario asking why costs rise turn over turn.

Sampling and Penalty Parameters

Parameter	Range	Default	Effect
temperature	0-2	1	Higher = more random/creative; 0 = near-deterministic
top_p	0-1	1	Nucleus sampling: keep tokens within cumulative top_p
max_tokens	1-limit	model-set	Caps completion length
frequency_penalty	-2 to 2	0	Positive reduces repeating tokens already used
presence_penalty	-2 to 2	0	Positive pushes toward new topics
stop	up to 4 strings	none	Halts generation at a sequence
n	1+	1	Number of alternative completions

Choosing temperature

Goal	Setting
Data extraction, code, factual Q&A	temperature 0
Balanced helpfulness	0.3-0.5
Brainstorming, creative copy	0.7-1.0

On the Exam: Microsoft explicitly recommends adjusting temperature OR top_p, not both simultaneously. A question that sets temperature=0.2, top_p=0.2 together is testing this rule — the correct guidance is to modify only one.

Prompt-Engineering Patterns

Zero-shot — instruction only, no examples.
Few-shot — supply user/assistant example pairs so the model infers the pattern.
Chain-of-thought — "Think step by step" elicits intermediate reasoning, improving math/logic accuracy.
Output-format spec — demand strict JSON; pair with low temperature for reliability.

messages = [
  {"role": "system", "content": "Classify sentiment as positive, negative, or neutral."},
  {"role": "user", "content": "The food was delicious!"},
  {"role": "assistant", "content": "positive"},
  {"role": "user", "content": "Service was terrible and slow."},
  {"role": "assistant", "content": "negative"},
  {"role": "user", "content": "It arrived on time and works perfectly!"}
]

Trap: few-shot examples go in the assistant role, not stuffed into the system message — the exam distinguishes the two.

Streaming

stream = client.chat.completions.create(
    model="gpt4o-chat",
    messages=[{"role": "user", "content": "Explain ML simply."}],
    stream=True
)
for chunk in stream:
    if chunk.choices and chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Streaming delivers tokens as deltas the moment they are produced, cutting perceived latency for chat UIs. It does not reduce total tokens or cost; the trade-off is that you cannot inspect the full response (or apply post-generation content checks) until the stream completes.

Tool / Function Calling

Function calling lets the model emit a structured request that your code executes — the model never runs the function itself.

tools = [{
  "type": "function",
  "function": {
    "name": "get_weather",
    "description": "Get current weather for a city",
    "parameters": {
      "type": "object",
      "properties": {
        "location": {"type": "string"},
        "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
      },
      "required": ["location"]
    }
  }
}]

resp = client.chat.completions.create(
    model="gpt4o-chat",
    messages=[{"role": "user", "content": "Weather in Seattle?"}],
    tools=tools, tool_choice="auto"
)
if resp.choices[0].message.tool_calls:
    call = resp.choices[0].message.tool_calls[0]
    args = json.loads(call.function.arguments)  # {"location": "Seattle, WA"}

The loop is: model returns tool_calls → your app runs the real function → you append a tool role message with the result → call the API again so the model writes the final natural-language answer.

Two behaviors matter for the exam. First, the model only proposes the call; it serializes the arguments as a JSON string you must parse, and your application is solely responsible for executing the work. Second, tool_choice governs the decision: auto lets the model choose whether to call a tool, none forbids tool use, and naming a function forces that specific call. You can register several tools at once and the model may request more than one in a single turn.

When to use which technique

Situation	Reach for
Pull live data or trigger an action	Tool / function calling
Enforce a structured task with no examples	Zero-shot + format spec
Teach a pattern from samples	Few-shot
Improve multi-step math or logic	Chain-of-thought
Reduce perceived wait in a chat UI	Streaming

On the Exam: Function calling is the foundation for the agent pattern in section 6.6 — the agent loop is just function calling plus managed threads and memory. Expect items contrasting tool_choice values and asking who executes the function (always your code, never the service).

Test Your Knowledge

In an Azure OpenAI Chat Completions call, what value belongs in the model parameter?

The deployment name you chose when deploying the model

The OpenAI model name such as gpt-4o

The Azure resource name

The api_version string

Test Your Knowledge

Microsoft's guidance on sampling parameters is to:

Always set both temperature and top_p to 0 for chat

Adjust temperature OR top_p, but not both at the same time

Never change temperature from its default of 1

Use frequency_penalty instead of temperature for creativity

Test Your Knowledge

A developer notices cost climbing as a chat conversation gets longer, even though replies stay short. Why?

Streaming charges extra per delta

The API stores history server-side and bills for storage

The full conversation history is resent as input tokens on every call because the API is stateless

Higher temperature increases token price

Test Your Knowledge

When the model returns a tool_calls object, what executes the requested function?

The Azure OpenAI service runs it automatically

Your application code runs the function and returns the result to the model

The function never actually runs; it is only logged

Azure AI Search executes it

Up Next

6.3 Embeddings and Vector Search for RAG

Continue learning

Azure AI Engineer Associate

Azure AI-102

6.2 Chat Completions API and Prompt Engineering

Key Takeaways

Minimal Call

Message Roles

Sampling and Penalty Parameters

Choosing temperature

Prompt-Engineering Patterns

Streaming

Tool / Function Calling

When to use which technique

Azure AI Engineer Associate

1Introduction

2Domain 1: Plan and Manage an Azure AI Solution (20-25%)

3Content Safety and Moderation (within Plan and Manage, Domain 1)

4Domain 4: Implement Computer Vision Solutions (10-15%)

5Domain 5: Implement Natural Language Processing Solutions (15-20%)

6Domain 6: Implement Knowledge Mining and Information Extraction Solutions (15-20%)

7Domain 2: Implement Generative AI Solutions (15-20%)

8Domain 3: Implement an Agentic Solution (5-10%)

9Exam Review: Cross-Domain Topics and Advanced Practice

Azure AI-102

6.2 Chat Completions API and Prompt Engineering

Key Takeaways

Minimal Call

Message Roles

Sampling and Penalty Parameters

Choosing temperature

Prompt-Engineering Patterns

Streaming

Tool / Function Calling

When to use which technique