In Azure OpenAI, what does the "model" parameter in the API call refer to?

The deployment name you specified when deploying the model. In Azure OpenAI, the "model" parameter refers to the deployment name you chose when deploying the model, NOT the model name itself. For example, if you deployed GPT-4o with the name "gpt4o-deployment", you use "gpt4o-deployment" as the model parameter.

Which parameter should you set to 0 for the most deterministic, factual responses?

temperature. Setting temperature to 0 produces the most deterministic and focused output. The model selects the highest-probability token at each step, producing consistent responses for the same input. Higher temperatures (closer to 2) introduce more randomness and creativity.

What is the purpose of the system message in the Chat Completions API?

To define the AI assistant's persona, constraints, and behavior. The system message defines how the AI assistant should behave — its persona, constraints, output format, and guardrails. It is the primary mechanism for controlling model output. For example: "You are a medical assistant. Only answer health questions. If unsure, say you don't know."

What prompt engineering technique uses example input-output pairs in the messages to guide the model?

Few-shot prompting. Few-shot prompting provides example input-output pairs (as user/assistant message pairs) before the actual query. These examples guide the model to understand the expected format and task. Zero-shot provides no examples; chain-of-thought asks the model to show its reasoning.

Chat Completions API and Prompt Engineering

Quick Answer: The Chat Completions API uses messages with system (behavior), user (input), and assistant (responses) roles. Control output with temperature (creativity), max_tokens (length), and top_p (diversity). Use prompt engineering techniques: few-shot examples, chain-of-thought, and output format specification.

Basic API Call

from openai import AzureOpenAI

client = AzureOpenAI(
    api_key="<your-key>",
    api_version="2024-06-01",
    azure_endpoint="https://my-openai.openai.azure.com/"
)

response = client.chat.completions.create(
    model="gpt4o-deployment",  # deployment name, NOT model name
    messages=[
        {
            "role": "system",
            "content": "You are a helpful AI assistant that provides concise, accurate answers about Azure AI services."
        },
        {
            "role": "user",
            "content": "What is Azure AI Search?"
        }
    ],
    temperature=0.7,
    max_tokens=500,
    top_p=0.95,
    frequency_penalty=0,
    presence_penalty=0,
    stop=None
)

print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")

On the Exam: The model parameter in Azure OpenAI is the deployment name, not the model name. This is different from the direct OpenAI API where you specify the model name (e.g., "gpt-4o"). On Azure, you use the name you chose when deploying the model.

Message Roles

Role	Purpose	Example
system	Define AI behavior, persona, constraints, and output format	"You are a medical triage assistant. Only answer health-related questions."
user	Human input — questions, requests, data	"What are the symptoms of flu?"
assistant	Model responses (for conversation history or few-shot examples)	"Common flu symptoms include fever, cough, and body aches."

Multi-Turn Conversation

messages = [
    {"role": "system", "content": "You are a helpful Azure expert."},
    {"role": "user", "content": "What is Azure AI Search?"},
    {"role": "assistant", "content": "Azure AI Search is a cloud search service..."},
    {"role": "user", "content": "How does it compare to Elasticsearch?"}
]
# Each turn includes full conversation history

Key Parameters

Parameter	Range	Default	Effect
temperature	0-2	1	Higher = more creative/random; Lower = more focused/deterministic
max_tokens	1 to model limit	Model-specific	Maximum tokens in the generated response
top_p	0-1	1	Nucleus sampling — consider tokens whose cumulative probability is within top_p
frequency_penalty	-2 to 2	0	Positive values reduce repetition of tokens already generated
presence_penalty	-2 to 2	0	Positive values encourage the model to talk about new topics
n	1+	1	Number of response alternatives to generate
stop	Array of strings	None	Sequences that cause the model to stop generating

Temperature vs. Top-p

Setting	When to Use
Temperature = 0	Factual answers, code generation, data extraction
Temperature = 0.3-0.5	Balanced creativity with accuracy
Temperature = 0.7-1.0	Creative writing, brainstorming
Top_p = 0.1	Very focused, deterministic output
Top_p = 0.9	Diverse but still coherent output

On the Exam: Do NOT set both temperature and top_p to non-default values simultaneously. Microsoft recommends adjusting one OR the other, not both. Questions may test this recommendation.

Prompt Engineering Techniques

1. Zero-Shot (No Examples)

messages = [
    {"role": "system", "content": "Classify the following text as positive, negative, or neutral."},
    {"role": "user", "content": "The product arrived on time and works perfectly!"}
]

2. Few-Shot (With Examples)

messages = [
    {"role": "system", "content": "Classify the following text as positive, negative, or neutral."},
    {"role": "user", "content": "The food was delicious!"},
    {"role": "assistant", "content": "positive"},
    {"role": "user", "content": "The service was terrible and slow."},
    {"role": "assistant", "content": "negative"},
    {"role": "user", "content": "The product arrived on time and works perfectly!"}
]

3. Chain-of-Thought

messages = [
    {"role": "system", "content": "Think step-by-step before providing your final answer."},
    {"role": "user", "content": "A company has 150 employees. 30% work remotely. How many work in the office?"}
]
# Model outputs: "Step 1: 30% of 150 = 45 remote workers.
#                 Step 2: 150 - 45 = 105 office workers.
#                 Answer: 105 employees work in the office."

4. Output Format Specification

messages = [
    {"role": "system", "content": """Extract information and return ONLY valid JSON:
{
    "name": "string",
    "date": "YYYY-MM-DD",
    "amount": number,
    "category": "string"
}"""},
    {"role": "user", "content": "Invoice from Contoso for \$5,000 dated March 15, 2026 for consulting services."}
]

Response Streaming

# Streaming response — tokens delivered incrementally
stream = client.chat.completions.create(
    model="gpt4o-deployment",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain machine learning in simple terms."}
    ],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Function Calling (Tool Use)

Function calling allows the model to invoke external functions/APIs:

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather in a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g., San Francisco, CA"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="gpt4o-deployment",
    messages=[
        {"role": "user", "content": "What's the weather in Seattle?"}
    ],
    tools=tools,
    tool_choice="auto"
)

# Check if the model wants to call a function
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    function_name = tool_call.function.name
    arguments = json.loads(tool_call.function.arguments)
    print(f"Call: {function_name}({arguments})")
    # → Call: get_weather({"location": "Seattle, WA"})

Azure AI Engineer Associate

6.2 Chat Completions API and Prompt Engineering

Key Takeaways

Chat Completions API and Prompt Engineering

Basic API Call

Message Roles

Multi-Turn Conversation

Key Parameters

Temperature vs. Top-p

Prompt Engineering Techniques

1. Zero-Shot (No Examples)

2. Few-Shot (With Examples)

3. Chain-of-Thought

4. Output Format Specification

Response Streaming

Function Calling (Tool Use)

Azure AI Engineer Associate

1Introduction

2Domain 1: Plan and Manage an Azure AI Solution (15-20%)

3Domain 2: Implement Content Moderation Solutions (10-15%)

4Domain 3: Implement Computer Vision Solutions (15-20%)

5Domain 4: Implement Natural Language Processing Solutions (25-30%)

6Domain 5: Implement Knowledge Mining and Document Intelligence Solutions (10-15%)

7Domain 6: Implement Generative AI Solutions (10-15%)

8Exam Review: Cross-Domain Topics and Advanced Practice

6.2 Chat Completions API and Prompt Engineering

Key Takeaways

Chat Completions API and Prompt Engineering

Basic API Call

Message Roles

Multi-Turn Conversation

Key Parameters

Temperature vs. Top-p

Prompt Engineering Techniques

1. Zero-Shot (No Examples)

2. Few-Shot (With Examples)

3. Chain-of-Thought

4. Output Format Specification

Response Streaming

Function Calling (Tool Use)