A production chatbot has steady, high-volume traffic and a strict latency requirement. Which deployment choice best fits?

Provisioned-Managed (PTU) deployment. Provisioned-Managed (PTU) reserves dedicated capacity, delivering guaranteed throughput and predictable latency for steady high-volume production traffic. Standard is best-effort and can throttle; Batch is for async offline jobs with a 24-hour SLA; and a multi-service resource cannot host Azure OpenAI models at all.

Can Azure OpenAI Service be accessed through a multi-service Azure AI Services resource?

No, it requires its own resource created with --kind OpenAI. Azure OpenAI requires a dedicated Cognitive Services account created with --kind OpenAI. It is not part of the multi-service Azure AI Services resource that covers Vision, Language, and Speech, so calling a GPT deployment on a multi-service resource fails.

A 30,000 TPM Standard deployment is throttling (HTTP 429) under load. What is the most appropriate fix?

Increase the deployment's TPM quota or move to Global Standard / PTU, with Retry-After backoff. HTTP 429 means the deployment exceeded its tokens-per-minute quota, not a model defect. Raising the TPM capacity, moving to a higher-quota Global Standard or reserved PTU deployment, and honoring the Retry-After header are the correct remedies. Changing temperature has no effect on rate limits.

Approximately how many tokens does a 1,000-word English document consume?

About 1,333 tokens. In English, one token is roughly 0.75 words, so words divided by 0.75 gives tokens: 1,000 / 0.75 is about 1,333 tokens. This matters because both context-window limits and billing are measured in tokens, with input and output priced separately per 1,000 tokens.

Azure OpenAI Service — Models and Deployment | Free Guide 2026

Key Takeaways

Azure OpenAI Service (now part of Microsoft Foundry Models) exposes OpenAI models such as GPT-4o, GPT-4.1, o-series reasoning models, GPT-image-1, DALL-E 3, Whisper, and text-embedding-3 inside Azure with enterprise security.
A model is unusable until you create a deployment: a named endpoint binding one model version to a capacity quota expressed in thousands of tokens per minute (TPM).
Deployment types in 2026 are Standard, Global Standard, Data Zone Standard, Provisioned-Managed (PTU), Global/Data Zone Provisioned, and Batch — they differ on data-residency scope and on pay-per-token versus reserved-capacity billing.
Azure OpenAI requires its own resource created with --kind OpenAI; it is NOT included in a multi-service Azure AI Services resource.
Tokens (~0.75 word each in English) drive both cost (priced per 1,000 input and output tokens separately) and the context-window and TPM limits you must size around.

Quick Answer: Azure OpenAI (now surfaced through Microsoft Foundry Models) gives you GPT-4o/GPT-4.1 (multimodal text+vision), o-series reasoning models, GPT-image-1 and DALL-E 3 (images), Whisper (speech-to-text), and text-embedding-3 (vectors). Nothing works until you create a deployment that binds one model version to a TPM quota. Pick a deployment type — Standard, Global Standard, Data Zone, Provisioned-Managed (PTU), or Batch — based on data residency and traffic predictability.

The AI-102 Context

Generative AI (Domain 2) is 15-20% of AI-102 (Designing and Implementing a Microsoft Azure AI Solution), among the most heavily revised areas in 2026. The exam is 40-60 questions in 100 minutes, passing at 700 / 1000, costing about USD $165. Note the live AI-102 retires June 30, 2026 and is replaced by a refreshed version — every model name below is current as of June 2026.

Model Catalog You Must Recognize

Model	Type	What it does	Context window
GPT-4o / GPT-4o mini	Multimodal	Text + image input, fast general chat	128K tokens
GPT-4.1 / 4.1 mini	Text + vision	Long-context coding & instruction following	up to 1M tokens
o-series (o1, o3-mini)	Reasoning	Step-by-step reasoning; no temperature control	128K-200K
GPT-image-1 / DALL-E 3	Image generation	Text-to-image	N/A
Whisper	Audio	Speech-to-text & translation	25 MB / file
text-embedding-3-large	Embeddings	3,072-dim vectors	8,191 tokens
text-embedding-3-small	Embeddings	1,536-dim vectors, cheaper	8,191 tokens

On the Exam: o-series reasoning models ignore temperature/top_p and use max_completion_tokens (reasoning tokens are billed but hidden). A question that sets temperature=0 on o1 is a distractor — the parameter is silently ignored.

Creating the Resource

Azure OpenAI must be its own Cognitive Services account. It is not bundled into the multi-service Azure AI Services resource that covers Vision, Language, and Speech.

az cognitiveservices account create \
  --name my-openai-service \
  --resource-group rg-ai-prod \
  --kind OpenAI \
  --sku S0 \
  --location eastus2 \
  --yes

The key flag is --kind OpenAI. A frequent trap: a scenario provisions a multi-service resource then asks why a GPT-4o call returns 404 — the fix is a dedicated OpenAI resource.

Deploying a Model

A deployment is a named alias mapping one model + version to a capacity quota. You call the deployment name, never the raw model name.

az cognitiveservices account deployment create \
  --name my-openai-service \
  --resource-group rg-ai-prod \
  --deployment-name gpt4o-chat \
  --model-name gpt-4o \
  --model-version "2024-11-20" \
  --model-format OpenAI \
  --sku-name GlobalStandard \
  --sku-capacity 50   # 50K tokens-per-minute quota

Deployment Types (2026)

Type	Data residency	Billing	Best for
Standard	Single Azure region you pick	Pay-per-token	Region-locked workloads
Global Standard	Routed to any global datacenter	Pay-per-token, highest quota	Default for dev/variable traffic
Data Zone Standard	Stays inside US or EU zone	Pay-per-token	EU/US data-residency rules
Provisioned-Managed (PTU)	Region / Data Zone / Global	Reserved capacity (hourly/monthly)	High-volume production, stable latency
Batch	Global	~50% cheaper, 24h SLA	Large async offline jobs

Provisioned Throughput Units (PTU): you reserve a fixed block of capacity for guaranteed throughput and predictable latency. PTUs lower per-token cost at high volume but require a commitment. Choose PTU when traffic is predictable and high-volume; choose Standard/Global for spiky or experimental loads.

Azure OpenAI vs. OpenAI Direct

Feature	Azure OpenAI	OpenAI Direct
Auth	API key or Microsoft Entra ID / managed identity	API key only
Network	VNet, private endpoints, IP firewall	Internet only
Content safety	Built-in, configurable filters	Limited
Compliance	SOC 2, HIPAA, ISO, FedRAMP	Limited
Data use	Prompts/outputs not used to train models	Not used (API)
SLA	99.9% Standard / higher for PTU	Best effort
Region	You choose the Azure region	OpenAI-managed

On the Exam: "Why choose Azure OpenAI over OpenAI?" → enterprise security and compliance: managed identity, private endpoints, content filtering, and regional/data-residency control. Prefer managed identity over API keys for any production scenario the exam describes.

Tokens, Quota, and Cost

Tokens are the billing and limiting unit. In English, 1 token ≈ 0.75 words ≈ 4 characters, so 1,000 words ≈ 1,333 tokens. Total request tokens = input (prompt) + output (completion), each priced separately per 1,000 tokens.

Limit	Meaning
Context window	Max input + output tokens per single request
Max output tokens	Cap on the completion length
TPM (tokens/min)	Throughput quota assigned to the deployment
RPM (requests/min)	Derived rate limit (roughly TPM/1000 × 6)

Worked example: A summarizer sends a 6,000-token document plus a 200-token instruction and expects a 600-token summary. Per call ≈ 6,800 tokens. A 30K-TPM Standard deployment supports ≈ 30,000 / 6,800 ≈ 4 calls/minute before throttling (HTTP 429). To scale, raise --sku-capacity (quota) or move to Global Standard / PTU. Common trap: developers blame the model for 429s when the real fix is increasing TPM quota or using Retry-After backoff.

Azure AI Engineer Associate

Azure AI-102

6.1 Azure OpenAI Service — Models and Deployment

Key Takeaways

The AI-102 Context

Model Catalog You Must Recognize

Creating the Resource

Deploying a Model

Deployment Types (2026)

Azure OpenAI vs. OpenAI Direct

Tokens, Quota, and Cost

Azure AI Engineer Associate

1Introduction

2Domain 1: Plan and Manage an Azure AI Solution (20-25%)

3Content Safety and Moderation (within Plan and Manage, Domain 1)

4Domain 4: Implement Computer Vision Solutions (10-15%)

5Domain 5: Implement Natural Language Processing Solutions (15-20%)

6Domain 6: Implement Knowledge Mining and Information Extraction Solutions (15-20%)

7Domain 2: Implement Generative AI Solutions (15-20%)

8Domain 3: Implement an Agentic Solution (5-10%)

9Exam Review: Cross-Domain Topics and Advanced Practice

Azure AI-102

6.1 Azure OpenAI Service — Models and Deployment

Key Takeaways

The AI-102 Context

Model Catalog You Must Recognize

Creating the Resource

Deploying a Model

Deployment Types (2026)

Azure OpenAI vs. OpenAI Direct

Tokens, Quota, and Cost