6.1 Azure OpenAI Service — Models and Deployment
Key Takeaways
- Azure OpenAI Service (now part of Microsoft Foundry Models) exposes OpenAI models such as GPT-4o, GPT-4.1, o-series reasoning models, GPT-image-1, DALL-E 3, Whisper, and text-embedding-3 inside Azure with enterprise security.
- A model is unusable until you create a deployment: a named endpoint binding one model version to a capacity quota expressed in thousands of tokens per minute (TPM).
- Deployment types in 2026 are Standard, Global Standard, Data Zone Standard, Provisioned-Managed (PTU), Global/Data Zone Provisioned, and Batch — they differ on data-residency scope and on pay-per-token versus reserved-capacity billing.
- Azure OpenAI requires its own resource created with --kind OpenAI; it is NOT included in a multi-service Azure AI Services resource.
- Tokens (~0.75 word each in English) drive both cost (priced per 1,000 input and output tokens separately) and the context-window and TPM limits you must size around.
Quick Answer: Azure OpenAI (now surfaced through Microsoft Foundry Models) gives you GPT-4o/GPT-4.1 (multimodal text+vision), o-series reasoning models, GPT-image-1 and DALL-E 3 (images), Whisper (speech-to-text), and text-embedding-3 (vectors). Nothing works until you create a deployment that binds one model version to a TPM quota. Pick a deployment type — Standard, Global Standard, Data Zone, Provisioned-Managed (PTU), or Batch — based on data residency and traffic predictability.
The AI-102 Context
Generative AI (Domain 2) is 15-20% of AI-102 (Designing and Implementing a Microsoft Azure AI Solution), among the most heavily revised areas in 2026. The exam is 40-60 questions in 100 minutes, passing at 700 / 1000, costing about USD $165. Note the live AI-102 retires June 30, 2026 and is replaced by a refreshed version — every model name below is current as of June 2026.
Model Catalog You Must Recognize
| Model | Type | What it does | Context window |
|---|---|---|---|
| GPT-4o / GPT-4o mini | Multimodal | Text + image input, fast general chat | 128K tokens |
| GPT-4.1 / 4.1 mini | Text + vision | Long-context coding & instruction following | up to 1M tokens |
| o-series (o1, o3-mini) | Reasoning | Step-by-step reasoning; no temperature control | 128K-200K |
| GPT-image-1 / DALL-E 3 | Image generation | Text-to-image | N/A |
| Whisper | Audio | Speech-to-text & translation | 25 MB / file |
| text-embedding-3-large | Embeddings | 3,072-dim vectors | 8,191 tokens |
| text-embedding-3-small | Embeddings | 1,536-dim vectors, cheaper | 8,191 tokens |
On the Exam: o-series reasoning models ignore
temperature/top_pand usemax_completion_tokens(reasoning tokens are billed but hidden). A question that setstemperature=0on o1 is a distractor — the parameter is silently ignored.
Creating the Resource
Azure OpenAI must be its own Cognitive Services account. It is not bundled into the multi-service Azure AI Services resource that covers Vision, Language, and Speech.
az cognitiveservices account create \
--name my-openai-service \
--resource-group rg-ai-prod \
--kind OpenAI \
--sku S0 \
--location eastus2 \
--yes
The key flag is --kind OpenAI. A frequent trap: a scenario provisions a multi-service resource then asks why a GPT-4o call returns 404 — the fix is a dedicated OpenAI resource.
Deploying a Model
A deployment is a named alias mapping one model + version to a capacity quota. You call the deployment name, never the raw model name.
az cognitiveservices account deployment create \
--name my-openai-service \
--resource-group rg-ai-prod \
--deployment-name gpt4o-chat \
--model-name gpt-4o \
--model-version "2024-11-20" \
--model-format OpenAI \
--sku-name GlobalStandard \
--sku-capacity 50 # 50K tokens-per-minute quota
Deployment Types (2026)
| Type | Data residency | Billing | Best for |
|---|---|---|---|
| Standard | Single Azure region you pick | Pay-per-token | Region-locked workloads |
| Global Standard | Routed to any global datacenter | Pay-per-token, highest quota | Default for dev/variable traffic |
| Data Zone Standard | Stays inside US or EU zone | Pay-per-token | EU/US data-residency rules |
| Provisioned-Managed (PTU) | Region / Data Zone / Global | Reserved capacity (hourly/monthly) | High-volume production, stable latency |
| Batch | Global | ~50% cheaper, 24h SLA | Large async offline jobs |
Provisioned Throughput Units (PTU): you reserve a fixed block of capacity for guaranteed throughput and predictable latency. PTUs lower per-token cost at high volume but require a commitment. Choose PTU when traffic is predictable and high-volume; choose Standard/Global for spiky or experimental loads.
Azure OpenAI vs. OpenAI Direct
| Feature | Azure OpenAI | OpenAI Direct |
|---|---|---|
| Auth | API key or Microsoft Entra ID / managed identity | API key only |
| Network | VNet, private endpoints, IP firewall | Internet only |
| Content safety | Built-in, configurable filters | Limited |
| Compliance | SOC 2, HIPAA, ISO, FedRAMP | Limited |
| Data use | Prompts/outputs not used to train models | Not used (API) |
| SLA | 99.9% Standard / higher for PTU | Best effort |
| Region | You choose the Azure region | OpenAI-managed |
On the Exam: "Why choose Azure OpenAI over OpenAI?" → enterprise security and compliance: managed identity, private endpoints, content filtering, and regional/data-residency control. Prefer managed identity over API keys for any production scenario the exam describes.
Tokens, Quota, and Cost
Tokens are the billing and limiting unit. In English, 1 token ≈ 0.75 words ≈ 4 characters, so 1,000 words ≈ 1,333 tokens. Total request tokens = input (prompt) + output (completion), each priced separately per 1,000 tokens.
| Limit | Meaning |
|---|---|
| Context window | Max input + output tokens per single request |
| Max output tokens | Cap on the completion length |
| TPM (tokens/min) | Throughput quota assigned to the deployment |
| RPM (requests/min) | Derived rate limit (roughly TPM/1000 × 6) |
Worked example: A summarizer sends a 6,000-token document plus a 200-token instruction and expects a 600-token summary. Per call ≈ 6,800 tokens. A 30K-TPM Standard deployment supports ≈ 30,000 / 6,800 ≈ 4 calls/minute before throttling (HTTP 429). To scale, raise --sku-capacity (quota) or move to Global Standard / PTU. Common trap: developers blame the model for 429s when the real fix is increasing TPM quota or using Retry-After backoff.
A production chatbot has steady, high-volume traffic and a strict latency requirement. Which deployment choice best fits?
Can Azure OpenAI Service be accessed through a multi-service Azure AI Services resource?
A 30,000 TPM Standard deployment is throttling (HTTP 429) under load. What is the most appropriate fix?
Approximately how many tokens does a 1,000-word English document consume?