6.1 Azure OpenAI Service — Models and Deployment

Key Takeaways

  • Azure OpenAI Service (now part of Microsoft Foundry Models) exposes OpenAI models such as GPT-4o, GPT-4.1, o-series reasoning models, GPT-image-1, DALL-E 3, Whisper, and text-embedding-3 inside Azure with enterprise security.
  • A model is unusable until you create a deployment: a named endpoint binding one model version to a capacity quota expressed in thousands of tokens per minute (TPM).
  • Deployment types in 2026 are Standard, Global Standard, Data Zone Standard, Provisioned-Managed (PTU), Global/Data Zone Provisioned, and Batch — they differ on data-residency scope and on pay-per-token versus reserved-capacity billing.
  • Azure OpenAI requires its own resource created with --kind OpenAI; it is NOT included in a multi-service Azure AI Services resource.
  • Tokens (~0.75 word each in English) drive both cost (priced per 1,000 input and output tokens separately) and the context-window and TPM limits you must size around.
Last updated: June 2026

Quick Answer: Azure OpenAI (now surfaced through Microsoft Foundry Models) gives you GPT-4o/GPT-4.1 (multimodal text+vision), o-series reasoning models, GPT-image-1 and DALL-E 3 (images), Whisper (speech-to-text), and text-embedding-3 (vectors). Nothing works until you create a deployment that binds one model version to a TPM quota. Pick a deployment type — Standard, Global Standard, Data Zone, Provisioned-Managed (PTU), or Batch — based on data residency and traffic predictability.

The AI-102 Context

Generative AI (Domain 2) is 15-20% of AI-102 (Designing and Implementing a Microsoft Azure AI Solution), among the most heavily revised areas in 2026. The exam is 40-60 questions in 100 minutes, passing at 700 / 1000, costing about USD $165. Note the live AI-102 retires June 30, 2026 and is replaced by a refreshed version — every model name below is current as of June 2026.

Model Catalog You Must Recognize

ModelTypeWhat it doesContext window
GPT-4o / GPT-4o miniMultimodalText + image input, fast general chat128K tokens
GPT-4.1 / 4.1 miniText + visionLong-context coding & instruction followingup to 1M tokens
o-series (o1, o3-mini)ReasoningStep-by-step reasoning; no temperature control128K-200K
GPT-image-1 / DALL-E 3Image generationText-to-imageN/A
WhisperAudioSpeech-to-text & translation25 MB / file
text-embedding-3-largeEmbeddings3,072-dim vectors8,191 tokens
text-embedding-3-smallEmbeddings1,536-dim vectors, cheaper8,191 tokens

On the Exam: o-series reasoning models ignore temperature/top_p and use max_completion_tokens (reasoning tokens are billed but hidden). A question that sets temperature=0 on o1 is a distractor — the parameter is silently ignored.

Creating the Resource

Azure OpenAI must be its own Cognitive Services account. It is not bundled into the multi-service Azure AI Services resource that covers Vision, Language, and Speech.

az cognitiveservices account create \
  --name my-openai-service \
  --resource-group rg-ai-prod \
  --kind OpenAI \
  --sku S0 \
  --location eastus2 \
  --yes

The key flag is --kind OpenAI. A frequent trap: a scenario provisions a multi-service resource then asks why a GPT-4o call returns 404 — the fix is a dedicated OpenAI resource.

Deploying a Model

A deployment is a named alias mapping one model + version to a capacity quota. You call the deployment name, never the raw model name.

az cognitiveservices account deployment create \
  --name my-openai-service \
  --resource-group rg-ai-prod \
  --deployment-name gpt4o-chat \
  --model-name gpt-4o \
  --model-version "2024-11-20" \
  --model-format OpenAI \
  --sku-name GlobalStandard \
  --sku-capacity 50   # 50K tokens-per-minute quota

Deployment Types (2026)

TypeData residencyBillingBest for
StandardSingle Azure region you pickPay-per-tokenRegion-locked workloads
Global StandardRouted to any global datacenterPay-per-token, highest quotaDefault for dev/variable traffic
Data Zone StandardStays inside US or EU zonePay-per-tokenEU/US data-residency rules
Provisioned-Managed (PTU)Region / Data Zone / GlobalReserved capacity (hourly/monthly)High-volume production, stable latency
BatchGlobal~50% cheaper, 24h SLALarge async offline jobs

Provisioned Throughput Units (PTU): you reserve a fixed block of capacity for guaranteed throughput and predictable latency. PTUs lower per-token cost at high volume but require a commitment. Choose PTU when traffic is predictable and high-volume; choose Standard/Global for spiky or experimental loads.

Azure OpenAI vs. OpenAI Direct

FeatureAzure OpenAIOpenAI Direct
AuthAPI key or Microsoft Entra ID / managed identityAPI key only
NetworkVNet, private endpoints, IP firewallInternet only
Content safetyBuilt-in, configurable filtersLimited
ComplianceSOC 2, HIPAA, ISO, FedRAMPLimited
Data usePrompts/outputs not used to train modelsNot used (API)
SLA99.9% Standard / higher for PTUBest effort
RegionYou choose the Azure regionOpenAI-managed

On the Exam: "Why choose Azure OpenAI over OpenAI?" → enterprise security and compliance: managed identity, private endpoints, content filtering, and regional/data-residency control. Prefer managed identity over API keys for any production scenario the exam describes.

Tokens, Quota, and Cost

Tokens are the billing and limiting unit. In English, 1 token ≈ 0.75 words ≈ 4 characters, so 1,000 words ≈ 1,333 tokens. Total request tokens = input (prompt) + output (completion), each priced separately per 1,000 tokens.

LimitMeaning
Context windowMax input + output tokens per single request
Max output tokensCap on the completion length
TPM (tokens/min)Throughput quota assigned to the deployment
RPM (requests/min)Derived rate limit (roughly TPM/1000 × 6)

Worked example: A summarizer sends a 6,000-token document plus a 200-token instruction and expects a 600-token summary. Per call ≈ 6,800 tokens. A 30K-TPM Standard deployment supports ≈ 30,000 / 6,800 ≈ 4 calls/minute before throttling (HTTP 429). To scale, raise --sku-capacity (quota) or move to Global Standard / PTU. Common trap: developers blame the model for 429s when the real fix is increasing TPM quota or using Retry-After backoff.

Test Your Knowledge

A production chatbot has steady, high-volume traffic and a strict latency requirement. Which deployment choice best fits?

A
B
C
D
Test Your Knowledge

Can Azure OpenAI Service be accessed through a multi-service Azure AI Services resource?

A
B
C
D
Test Your Knowledge

A 30,000 TPM Standard deployment is throttling (HTTP 429) under load. What is the most appropriate fix?

A
B
C
D
Test Your Knowledge

Approximately how many tokens does a 1,000-word English document consume?

A
B
C
D