4.1 Workload Classification Scenarios
Key Takeaways
- AI-901 service selection starts by classifying the workload before naming a product: generative, agentic, text analysis, speech, vision, information extraction, content safety, or predictive machine learning.
- The fastest scenario clue is the data direction: written text points to Azure Language or Translator, audio points to Azure Speech, visual input points to Azure Vision or a multimodal model, and mixed-media extraction points to Content Understanding.
- Use Microsoft Foundry models for open-ended generation and reasoning, but use Foundry Tools when the need is a proven API such as transcription, entity extraction, OCR, or content moderation.
- Agentic AI is different from a normal chat response because an agent works through steps and can use tools, memory, or external systems to complete a task.
- Responsible AI, privacy, region, latency, cost, and human-review needs can change the right service even when the workload label looks obvious.
Classify Before You Pick
AI-901 scenario questions often look like service-name questions, but the underlying skill is workload classification. Microsoft explicitly lists common workloads in the AI-901 study guide: generative and agentic AI, text analysis, speech, computer vision, and information extraction. Treat those as the first sorting buckets. If you jump straight to a product name, you can miss a detail such as audio input, structured output, or tool use.
Start with the business action. Is the system creating new content, understanding existing content, extracting fields, predicting a category, or taking steps through tools? Then identify the input and output. The same customer-support workflow might use Azure Language to score sentiment, Azure Speech to transcribe calls, Content Understanding to extract call topics from recordings, and a Foundry model to draft a response.
Workload Map
| Scenario signal | Classify as | Likely Azure fit | Main exam question |
|---|---|---|---|
| Write an answer, draft code, summarize with reasoning, or create an image | Generative AI | Foundry model or image-generation model | Is the output new content? |
| Decide when to call an API, search files, or complete a multi-step task | Agentic AI | Foundry Agent Service or agent client | Does the AI need tools or actions? |
| Sentiment, entities, key phrases, language detection, or PII in written text | Text analysis | Azure Language in Foundry Tools | Is the input already text? |
| Transcription, captions, spoken translation, neural voice, speaker identity | Speech | Azure Speech in Foundry Tools | Is audio the input or output? |
| Analyze an existing photo, read visible text, detect objects, or caption an image | Computer vision | Azure Vision or a multimodal model | Is the app interpreting visual input? |
| Pull fields, sections, topics, or JSON from documents, images, audio, or video | Information extraction | Azure Content Understanding | Is the goal structured output from messy media? |
| Detect harmful prompts, image content, groundedness, protected material, or prompt attacks | Content safety | Azure AI Content Safety or Foundry guardrails | Is the risk unsafe or policy-breaking content? |
| Predict a category, number, future value, or outlier from historical data | Predictive machine learning | Azure Machine Learning or a trained model path | Is it estimating rather than generating? |
The Four-Question Read
Use this process every time:
- Input: What does the app receive: text, speech, image, video, document, table data, or a user goal?
- Verb: What must the AI do: generate, classify, translate, transcribe, extract, moderate, search, or act?
- Output: Does the business need text, audio, an image, a label, a transcript, a confidence score, or structured JSON?
- Risk: Does the output affect people, expose sensitive data, require human review, or need content filtering?
That last question matters because service fit is not only functional. A model that can answer a question may still be the wrong production choice if it cannot cite sources, keep private data protected, or provide confidence signals for review.
Traps That Separate Similar Workloads
OCR versus extraction: Optical character recognition reads visible text. Content Understanding can also organize fields, tables, confidence, grounding, classification, and summaries across documents, images, audio, and video. If the scenario says read one sign, OCR may be enough. If it says process applications, invoices, recordings, or videos into a schema, think Content Understanding.
Speech recognition versus speaker recognition: Speech recognition finds the words. Speaker recognition identifies or verifies the voice. A captioning app needs speech to text. A secure voice gate may need speaker verification, with privacy controls.
Vision analysis versus image generation: Vision interprets an existing visual. Image generation creates a new visual from a prompt. A product-photo description is analysis; a new product mockup is generation.
Chat versus agent: A chat client answers a prompt. An agent can choose tools and work through steps. If the scenario needs the AI to check inventory, create a ticket, or update a record, classify it as agentic.
For AI-901, write the workload name before the service name in your scratch notes. That single habit prevents most service-selection errors.
A city inspection team uploads photos, short videos, and inspector voice notes. The app must return a JSON record of likely code violations, confidence scores, and source references for a reviewer. Which workload and service fit best?
A retail assistant must answer a customer, check live inventory, reserve two items, and create a pickup task if stock is available. How should this be classified?