1.5 Azure AI Solution Architecture Patterns

Key Takeaways

  • Microservices isolate each AI capability for independent scaling and fault containment; orchestration coordinates services in a pipeline via Functions, Logic Apps, or Durable Functions.
  • RAG (Azure AI Search retrieval + Azure OpenAI generation) is the most-tested architecture pattern on AI-102 and the cure for hallucination on enterprise data.
  • Edge deployment runs containerized AI on Azure IoT Edge for low-latency, intermittent-connectivity scenarios, but containers still need periodic connectivity for billing.
  • Language, Speech, Vision Read, Document Intelligence, and Translator offer Docker containers; not all cloud features are available in the container.
  • Multi-region deployment behind Front Door or Traffic Manager delivers HA and DR for AI workloads.
Last updated: June 2026

Quick Answer: The four patterns AI-102 tests are microservices (independent services), orchestration (pipeline coordination), RAG (Azure AI Search + Azure OpenAI), and edge/container (IoT Edge). Design for scalability, fault tolerance, cost, security, and compliance. RAG is the most heavily tested.

Pattern 1: Microservices

Each capability is its own deployable service behind an API gateway, scaling and failing independently.

[Client] -> [API Gateway / Azure API Management]
              |-- Vision Service   -> Azure AI Vision
              |-- Language Service -> Azure AI Language
              |-- Speech Service   -> Azure AI Speech
              |-- Search Service   -> Azure AI Search

Use when: large apps where capabilities have different scaling and release cadences and you want fault isolation (a Vision outage must not take down Language).

Pattern 2: Orchestration pipeline

A central coordinator — Azure Functions, Durable Functions, or Logic Apps — runs services in sequence or fan-out/fan-in.

[Input] -> [Orchestrator]
            |-- 1. OCR            (Document Intelligence)
            |-- 2. Entity + PII   (AI Language)
            |-- 3. Sentiment      (AI Language)
            |-- 4. Safety check   (Content Safety)
            |-- 5. Index results  (AI Search)

Use when: document-processing and content-enrichment workflows with multiple ordered steps. Durable Functions is the right pick when a question stresses long-running, stateful, or fan-out orchestration.

Pattern 3: RAG (Retrieval-Augmented Generation)

The flagship pattern. Retrieve grounded context, then generate.

[User query]
  -> embed query, search Azure AI Search (vector / hybrid + semantic ranking)
  -> build prompt: system message + retrieved chunks + user query
  -> Azure OpenAI chat completion (grounded answer + citations)
  -> Content Safety / groundedness check
  -> response to user

Use when: enterprise chatbots, knowledge bases, support assistants, document Q&A. RAG solves the core hallucination problem by forcing the model to answer from retrieved enterprise content. Exam questions pair Azure AI Search (retrieval) with Azure OpenAI (generation) — recognize that two-service signature instantly. Vector or hybrid search plus semantic ranking is the recommended retrieval configuration.

Pattern 4: Edge deployment with containers

Run inference locally on Azure IoT Edge for latency, offline tolerance, or data-residency.

[Camera / IoT device]
  -> Azure IoT Edge
       |-- Custom Vision container (local inference)
       |-- Speech container (local STT/TTS)
  -> Azure IoT Hub (sync results to cloud)

Use when: assembly-line defect detection, retail analytics, or remote sites with poor connectivity.

Container support and constraints

ServiceContainerizedTypical edge use
Azure AI LanguageYesSentiment, NER, key phrases on-prem
Azure AI SpeechYesSTT/TTS without internet dependency
Azure AI Vision (Read/OCR)YesOCR at the edge
Document IntelligenceYesForm processing on-prem
Azure AI TranslatorYesOffline translation

Four container facts the exam repeats: (1) the model runs locally, but the container still needs periodic connectivity to Azure for billing/metering and will stop after extended disconnection; (2) you must accept the EULA and pass Endpoint + ApiKey + Billing environment variables on docker run; (3) not every cloud feature is available in the container; (4) containers are chosen for latency, compliance, or connectivity, never for cheaper compute.

On the Exam: "Real-time inference with no reliable internet" => an edge container on IoT Edge, with the caveat that billing still requires periodic connectivity. A purely cloud answer fails the connectivity requirement.

Cost optimization

LeverHow it savesTypical savings
Commitment-tier pricingPre-purchase usage at a discount15-30%
Right-sizingMatch provisioned throughput to demandVariable
Batch APIsDefer non-urgent work to batch40-60%
CachingReuse repeated resultsVariable
F0 for devFree dev/test usage100% (dev only)
Lower-cost regionsDeploy where compliance allows5-20%

High availability and disaster recovery

  • Multi-region: deploy the AI resource in two or more regions and route with Azure Front Door or Traffic Manager, failing over automatically when a region is unhealthy.
  • Data protection: keep training data and exported custom models in geo-redundant storage (GRS), back up custom model configurations, and document training parameters for reproducibility.
  • Quota awareness: for Azure OpenAI, spread provisioned throughput or pay-as-you-go deployments across regions so a single-region quota limit cannot become a single point of failure.

Choosing between the patterns

The patterns are not mutually exclusive, and exam scenarios often blend them — a RAG chatbot is usually also a set of microservices behind an API gateway, with an orchestrator stitching retrieval, generation, and a safety check together. The skill is choosing the dominant pattern the scenario is really asking about. If the emphasis is independent scaling and fault isolation of distinct capabilities, the answer is microservices. If the emphasis is coordinating ordered, possibly stateful steps, the answer is orchestration (and Durable Functions for long-running or fan-out work).

If the emphasis is grounding generative answers in private enterprise data, the answer is RAG. If the emphasis is latency, offline operation, or keeping data on-premises, the answer is the edge/container pattern. Treat the requirement keywords as the discriminator: "scale each independently," "multi-step pipeline," "ground the answers," and "no reliable internet" each point to exactly one pattern.

Throughput, scaling, and resilience details

For Azure OpenAI specifically, two deployment models matter for architecture questions. Standard (pay-as-you-go) deployments share regional capacity and are billed per token — good for variable or low-volume workloads. Provisioned Throughput Units (PTUs) reserve dedicated capacity for predictable, high-volume, low-latency workloads with stable cost. When a scenario stresses consistent latency at scale or guaranteed capacity, PTUs are the answer; when it stresses cost efficiency for spiky traffic, standard is the answer.

Add retry with exponential backoff for 429 (throttling) responses, caching for repeated prompts, and a multi-region failover so a regional capacity shortfall degrades gracefully rather than failing outright. These resilience patterns combine with the cost levers above: batch APIs and caching cut spend, while PTUs and multi-region routing protect availability and latency.

On the Exam: When the requirement is grounded answers over enterprise data, the two-service RAG signature (Azure AI Search + Azure OpenAI) is the answer even if microservices or orchestration also appear in the diagram — the grounding requirement is what the question is really testing.

Test Your Knowledge

Which architecture pattern combines Azure AI Search retrieval with Azure OpenAI generation to ground answers in enterprise data?

A
B
C
D
Test Your Knowledge

Why does an Azure AI container deployed on-premises still require periodic internet connectivity?

A
B
C
D
Test Your Knowledge

A factory needs real-time defect detection on a line with unreliable internet. Which deployment best fits?

A
B
C
D
Test Your Knowledge

Which orchestration choice best fits a long-running, stateful, fan-out/fan-in AI processing workflow?

A
B
C
D