6.2 Chat Completions API and Prompt Engineering

Key Takeaways

  • The Chat Completions API uses a messages array with three roles: system (behavior instructions), user (human input), and assistant (model responses).
  • The system message defines the AI assistant's persona, constraints, and behavior — it is the primary mechanism for controlling model output.
  • Key parameters: temperature (0-2, creativity), max_tokens (output length), top_p (nucleus sampling), frequency_penalty, presence_penalty, and stop sequences.
  • Prompt engineering techniques include few-shot examples, chain-of-thought reasoning, role-based instructions, and output format specification.
  • Response streaming delivers tokens incrementally as they are generated, reducing perceived latency for real-time applications.
Last updated: March 2026

Chat Completions API and Prompt Engineering

Quick Answer: The Chat Completions API uses messages with system (behavior), user (input), and assistant (responses) roles. Control output with temperature (creativity), max_tokens (length), and top_p (diversity). Use prompt engineering techniques: few-shot examples, chain-of-thought, and output format specification.

Basic API Call

from openai import AzureOpenAI

client = AzureOpenAI(
    api_key="<your-key>",
    api_version="2024-06-01",
    azure_endpoint="https://my-openai.openai.azure.com/"
)

response = client.chat.completions.create(
    model="gpt4o-deployment",  # deployment name, NOT model name
    messages=[
        {
            "role": "system",
            "content": "You are a helpful AI assistant that provides concise, accurate answers about Azure AI services."
        },
        {
            "role": "user",
            "content": "What is Azure AI Search?"
        }
    ],
    temperature=0.7,
    max_tokens=500,
    top_p=0.95,
    frequency_penalty=0,
    presence_penalty=0,
    stop=None
)

print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")

On the Exam: The model parameter in Azure OpenAI is the deployment name, not the model name. This is different from the direct OpenAI API where you specify the model name (e.g., "gpt-4o"). On Azure, you use the name you chose when deploying the model.

Message Roles

RolePurposeExample
systemDefine AI behavior, persona, constraints, and output format"You are a medical triage assistant. Only answer health-related questions."
userHuman input — questions, requests, data"What are the symptoms of flu?"
assistantModel responses (for conversation history or few-shot examples)"Common flu symptoms include fever, cough, and body aches."

Multi-Turn Conversation

messages = [
    {"role": "system", "content": "You are a helpful Azure expert."},
    {"role": "user", "content": "What is Azure AI Search?"},
    {"role": "assistant", "content": "Azure AI Search is a cloud search service..."},
    {"role": "user", "content": "How does it compare to Elasticsearch?"}
]
# Each turn includes full conversation history

Key Parameters

ParameterRangeDefaultEffect
temperature0-21Higher = more creative/random; Lower = more focused/deterministic
max_tokens1 to model limitModel-specificMaximum tokens in the generated response
top_p0-11Nucleus sampling — consider tokens whose cumulative probability is within top_p
frequency_penalty-2 to 20Positive values reduce repetition of tokens already generated
presence_penalty-2 to 20Positive values encourage the model to talk about new topics
n1+1Number of response alternatives to generate
stopArray of stringsNoneSequences that cause the model to stop generating

Temperature vs. Top-p

SettingWhen to Use
Temperature = 0Factual answers, code generation, data extraction
Temperature = 0.3-0.5Balanced creativity with accuracy
Temperature = 0.7-1.0Creative writing, brainstorming
Top_p = 0.1Very focused, deterministic output
Top_p = 0.9Diverse but still coherent output

On the Exam: Do NOT set both temperature and top_p to non-default values simultaneously. Microsoft recommends adjusting one OR the other, not both. Questions may test this recommendation.

Prompt Engineering Techniques

1. Zero-Shot (No Examples)

messages = [
    {"role": "system", "content": "Classify the following text as positive, negative, or neutral."},
    {"role": "user", "content": "The product arrived on time and works perfectly!"}
]

2. Few-Shot (With Examples)

messages = [
    {"role": "system", "content": "Classify the following text as positive, negative, or neutral."},
    {"role": "user", "content": "The food was delicious!"},
    {"role": "assistant", "content": "positive"},
    {"role": "user", "content": "The service was terrible and slow."},
    {"role": "assistant", "content": "negative"},
    {"role": "user", "content": "The product arrived on time and works perfectly!"}
]

3. Chain-of-Thought

messages = [
    {"role": "system", "content": "Think step-by-step before providing your final answer."},
    {"role": "user", "content": "A company has 150 employees. 30% work remotely. How many work in the office?"}
]
# Model outputs: "Step 1: 30% of 150 = 45 remote workers.
#                 Step 2: 150 - 45 = 105 office workers.
#                 Answer: 105 employees work in the office."

4. Output Format Specification

messages = [
    {"role": "system", "content": """Extract information and return ONLY valid JSON:
{
    "name": "string",
    "date": "YYYY-MM-DD",
    "amount": number,
    "category": "string"
}"""},
    {"role": "user", "content": "Invoice from Contoso for \$5,000 dated March 15, 2026 for consulting services."}
]

Response Streaming

# Streaming response — tokens delivered incrementally
stream = client.chat.completions.create(
    model="gpt4o-deployment",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain machine learning in simple terms."}
    ],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Function Calling (Tool Use)

Function calling allows the model to invoke external functions/APIs:

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather in a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city and state, e.g., San Francisco, CA"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"]
                    }
                },
                "required": ["location"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="gpt4o-deployment",
    messages=[
        {"role": "user", "content": "What's the weather in Seattle?"}
    ],
    tools=tools,
    tool_choice="auto"
)

# Check if the model wants to call a function
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    function_name = tool_call.function.name
    arguments = json.loads(tool_call.function.arguments)
    print(f"Call: {function_name}({arguments})")
    # → Call: get_weather({"location": "Seattle, WA"})
Test Your Knowledge

In Azure OpenAI, what does the "model" parameter in the API call refer to?

A
B
C
D
Test Your Knowledge

Which parameter should you set to 0 for the most deterministic, factual responses?

A
B
C
D
Test Your Knowledge

What is the purpose of the system message in the Chat Completions API?

A
B
C
D
Test Your Knowledge

What prompt engineering technique uses example input-output pairs in the messages to guide the model?

A
B
C
D