6.2 Chat Completions API and Prompt Engineering
Key Takeaways
- The Chat Completions API uses a messages array with three roles: system (behavior instructions), user (human input), and assistant (model responses).
- The system message defines the AI assistant's persona, constraints, and behavior — it is the primary mechanism for controlling model output.
- Key parameters: temperature (0-2, creativity), max_tokens (output length), top_p (nucleus sampling), frequency_penalty, presence_penalty, and stop sequences.
- Prompt engineering techniques include few-shot examples, chain-of-thought reasoning, role-based instructions, and output format specification.
- Response streaming delivers tokens incrementally as they are generated, reducing perceived latency for real-time applications.
Chat Completions API and Prompt Engineering
Quick Answer: The Chat Completions API uses messages with system (behavior), user (input), and assistant (responses) roles. Control output with temperature (creativity), max_tokens (length), and top_p (diversity). Use prompt engineering techniques: few-shot examples, chain-of-thought, and output format specification.
Basic API Call
from openai import AzureOpenAI
client = AzureOpenAI(
api_key="<your-key>",
api_version="2024-06-01",
azure_endpoint="https://my-openai.openai.azure.com/"
)
response = client.chat.completions.create(
model="gpt4o-deployment", # deployment name, NOT model name
messages=[
{
"role": "system",
"content": "You are a helpful AI assistant that provides concise, accurate answers about Azure AI services."
},
{
"role": "user",
"content": "What is Azure AI Search?"
}
],
temperature=0.7,
max_tokens=500,
top_p=0.95,
frequency_penalty=0,
presence_penalty=0,
stop=None
)
print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")
On the Exam: The
modelparameter in Azure OpenAI is the deployment name, not the model name. This is different from the direct OpenAI API where you specify the model name (e.g., "gpt-4o"). On Azure, you use the name you chose when deploying the model.
Message Roles
| Role | Purpose | Example |
|---|---|---|
| system | Define AI behavior, persona, constraints, and output format | "You are a medical triage assistant. Only answer health-related questions." |
| user | Human input — questions, requests, data | "What are the symptoms of flu?" |
| assistant | Model responses (for conversation history or few-shot examples) | "Common flu symptoms include fever, cough, and body aches." |
Multi-Turn Conversation
messages = [
{"role": "system", "content": "You are a helpful Azure expert."},
{"role": "user", "content": "What is Azure AI Search?"},
{"role": "assistant", "content": "Azure AI Search is a cloud search service..."},
{"role": "user", "content": "How does it compare to Elasticsearch?"}
]
# Each turn includes full conversation history
Key Parameters
| Parameter | Range | Default | Effect |
|---|---|---|---|
| temperature | 0-2 | 1 | Higher = more creative/random; Lower = more focused/deterministic |
| max_tokens | 1 to model limit | Model-specific | Maximum tokens in the generated response |
| top_p | 0-1 | 1 | Nucleus sampling — consider tokens whose cumulative probability is within top_p |
| frequency_penalty | -2 to 2 | 0 | Positive values reduce repetition of tokens already generated |
| presence_penalty | -2 to 2 | 0 | Positive values encourage the model to talk about new topics |
| n | 1+ | 1 | Number of response alternatives to generate |
| stop | Array of strings | None | Sequences that cause the model to stop generating |
Temperature vs. Top-p
| Setting | When to Use |
|---|---|
| Temperature = 0 | Factual answers, code generation, data extraction |
| Temperature = 0.3-0.5 | Balanced creativity with accuracy |
| Temperature = 0.7-1.0 | Creative writing, brainstorming |
| Top_p = 0.1 | Very focused, deterministic output |
| Top_p = 0.9 | Diverse but still coherent output |
On the Exam: Do NOT set both temperature and top_p to non-default values simultaneously. Microsoft recommends adjusting one OR the other, not both. Questions may test this recommendation.
Prompt Engineering Techniques
1. Zero-Shot (No Examples)
messages = [
{"role": "system", "content": "Classify the following text as positive, negative, or neutral."},
{"role": "user", "content": "The product arrived on time and works perfectly!"}
]
2. Few-Shot (With Examples)
messages = [
{"role": "system", "content": "Classify the following text as positive, negative, or neutral."},
{"role": "user", "content": "The food was delicious!"},
{"role": "assistant", "content": "positive"},
{"role": "user", "content": "The service was terrible and slow."},
{"role": "assistant", "content": "negative"},
{"role": "user", "content": "The product arrived on time and works perfectly!"}
]
3. Chain-of-Thought
messages = [
{"role": "system", "content": "Think step-by-step before providing your final answer."},
{"role": "user", "content": "A company has 150 employees. 30% work remotely. How many work in the office?"}
]
# Model outputs: "Step 1: 30% of 150 = 45 remote workers.
# Step 2: 150 - 45 = 105 office workers.
# Answer: 105 employees work in the office."
4. Output Format Specification
messages = [
{"role": "system", "content": """Extract information and return ONLY valid JSON:
{
"name": "string",
"date": "YYYY-MM-DD",
"amount": number,
"category": "string"
}"""},
{"role": "user", "content": "Invoice from Contoso for \$5,000 dated March 15, 2026 for consulting services."}
]
Response Streaming
# Streaming response — tokens delivered incrementally
stream = client.chat.completions.create(
model="gpt4o-deployment",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain machine learning in simple terms."}
],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Function Calling (Tool Use)
Function calling allows the model to invoke external functions/APIs:
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather in a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g., San Francisco, CA"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"]
}
},
"required": ["location"]
}
}
}
]
response = client.chat.completions.create(
model="gpt4o-deployment",
messages=[
{"role": "user", "content": "What's the weather in Seattle?"}
],
tools=tools,
tool_choice="auto"
)
# Check if the model wants to call a function
if response.choices[0].message.tool_calls:
tool_call = response.choices[0].message.tool_calls[0]
function_name = tool_call.function.name
arguments = json.loads(tool_call.function.arguments)
print(f"Call: {function_name}({arguments})")
# → Call: get_weather({"location": "Seattle, WA"})
In Azure OpenAI, what does the "model" parameter in the API call refer to?
Which parameter should you set to 0 for the most deterministic, factual responses?
What is the purpose of the system message in the Chat Completions API?
What prompt engineering technique uses example input-output pairs in the messages to guide the model?