4.1 Workload Classification Scenarios

Key Takeaways

  • AI-901 service selection starts by classifying the workload before naming a product: generative, agentic, text analysis, speech, vision, information extraction, content safety, or predictive machine learning.
  • The fastest scenario clue is the data direction: written text points to Azure Language or Translator, audio points to Azure Speech, visual input points to Azure Vision or a multimodal model, and mixed-media extraction points to Content Understanding.
  • Use Microsoft Foundry models for open-ended generation and reasoning, but use Foundry Tools when the need is a proven API such as transcription, entity extraction, OCR, or content moderation.
  • Agentic AI is different from a normal chat response because an agent works through steps and can use tools, memory, or external systems to complete a task.
  • Responsible AI, privacy, region, latency, cost, and human-review needs can change the right service even when the workload label looks obvious.
Last updated: June 2026

Classify Before You Pick

AI-901 scenario questions often look like service-name questions, but the underlying skill is workload classification. Microsoft explicitly lists common workloads in the AI-901 study guide: generative and agentic AI, text analysis, speech, computer vision, and information extraction. Treat those as the first sorting buckets. If you jump straight to a product name, you can miss a detail such as audio input, structured output, or tool use.

Start with the business action. Is the system creating new content, understanding existing content, extracting fields, predicting a category, or taking steps through tools? Then identify the input and output. The same customer-support workflow might use Azure Language to score sentiment, Azure Speech to transcribe calls, Content Understanding to extract call topics from recordings, and a Foundry model to draft a response.

Workload Map

Scenario signalClassify asLikely Azure fitMain exam question
Write an answer, draft code, summarize with reasoning, or create an imageGenerative AIFoundry model or image-generation modelIs the output new content?
Decide when to call an API, search files, or complete a multi-step taskAgentic AIFoundry Agent Service or agent clientDoes the AI need tools or actions?
Sentiment, entities, key phrases, language detection, or PII in written textText analysisAzure Language in Foundry ToolsIs the input already text?
Transcription, captions, spoken translation, neural voice, speaker identitySpeechAzure Speech in Foundry ToolsIs audio the input or output?
Analyze an existing photo, read visible text, detect objects, or caption an imageComputer visionAzure Vision or a multimodal modelIs the app interpreting visual input?
Pull fields, sections, topics, or JSON from documents, images, audio, or videoInformation extractionAzure Content UnderstandingIs the goal structured output from messy media?
Detect harmful prompts, image content, groundedness, protected material, or prompt attacksContent safetyAzure AI Content Safety or Foundry guardrailsIs the risk unsafe or policy-breaking content?
Predict a category, number, future value, or outlier from historical dataPredictive machine learningAzure Machine Learning or a trained model pathIs it estimating rather than generating?

The Four-Question Read

Use this process every time:

  1. Input: What does the app receive: text, speech, image, video, document, table data, or a user goal?
  2. Verb: What must the AI do: generate, classify, translate, transcribe, extract, moderate, search, or act?
  3. Output: Does the business need text, audio, an image, a label, a transcript, a confidence score, or structured JSON?
  4. Risk: Does the output affect people, expose sensitive data, require human review, or need content filtering?

That last question matters because service fit is not only functional. A model that can answer a question may still be the wrong production choice if it cannot cite sources, keep private data protected, or provide confidence signals for review.

Traps That Separate Similar Workloads

OCR versus extraction: Optical character recognition reads visible text. Content Understanding can also organize fields, tables, confidence, grounding, classification, and summaries across documents, images, audio, and video. If the scenario says read one sign, OCR may be enough. If it says process applications, invoices, recordings, or videos into a schema, think Content Understanding.

Speech recognition versus speaker recognition: Speech recognition finds the words. Speaker recognition identifies or verifies the voice. A captioning app needs speech to text. A secure voice gate may need speaker verification, with privacy controls.

Vision analysis versus image generation: Vision interprets an existing visual. Image generation creates a new visual from a prompt. A product-photo description is analysis; a new product mockup is generation.

Chat versus agent: A chat client answers a prompt. An agent can choose tools and work through steps. If the scenario needs the AI to check inventory, create a ticket, or update a record, classify it as agentic.

For AI-901, write the workload name before the service name in your scratch notes. That single habit prevents most service-selection errors.

Test Your Knowledge

A city inspection team uploads photos, short videos, and inspector voice notes. The app must return a JSON record of likely code violations, confidence scores, and source references for a reviewer. Which workload and service fit best?

A
B
C
D
Test Your Knowledge

A retail assistant must answer a customer, check live inventory, reserve two items, and create a pickup task if stock is available. How should this be classified?

A
B
C
D