Vision, Multimodal, Speech, and Language Workloads

Key Takeaways

AI-103 groups computer vision (10-15%) and text analysis (10-15%) as separate skill areas, but exam scenarios blend them into agent, captioning, and safety workflows.
Use a multimodal model for visual question answering and captions; use Azure AI Vision OCR (the Read capability) only when the goal is machine-readable text.
Azure AI Speech covers speech-to-text, text-to-speech, and speech translation; pick real-time, fast, or batch transcription by latency and file size.
Azure AI Language and Azure Translator cover entity extraction, sentiment, PII detection, summarization, and text or document translation.
Text inside images, screenshots, and scanned pages is untrusted input that can carry indirect prompt injection, so apply prompt shields before an agent acts on it.

Last updated: June 2026

Vision, Multimodal, Speech, and Language Workloads

Quick Answer: On AI-103, do not treat vision and language as legacy standalone APIs. Microsoft tests whether you can pick the right Microsoft Foundry capability for a workload: a multimodal model for visual reasoning, Azure AI Vision optical character recognition (OCR) for reading text in images, Azure AI Content Understanding for structured multimodal extraction, Azure AI Speech for voice, Azure AI Language for text analytics, Azure Translator for translation, and Azure AI Content Safety for unsafe text or images.

The computer vision skill area (10-15% of the exam) and the text analysis skill area (10-15%) are scored separately, but the questions rarely stay in their lane. A single scenario can ask you to caption an image, transcribe a voice note, and redact a phone number. Score those parts independently in your head, then pick one architecture that satisfies all of them.

Vision and Multimodal Decisions

A vision question usually opens with the input type: image, video, screenshot, scanned document, or generated media. Then it states the required output. If the app needs a caption, alt text, or an answer grounded in what is visible, a multimodal model (or a Foundry visual-understanding workflow) is the starting point. If the output must locate visible items, watch for the words object, component, region, bounding box, or zone. If the clue is text printed on a label, whiteboard, receipt, or package, the first capability is OCR via the Read model.

Requirement clue	Strong answer	Why it wins
"Describe this screenshot" or "answer from this image"	Multimodal model	The model reasons over text plus pixels in one prompt.
"Read serial numbers in photos"	Azure AI Vision OCR / Read	The goal is machine-readable text, not a general caption.
"Extract fields from a form"	Document Intelligence or Content Understanding	OCR alone does not produce reliable, named business fields.
"Generate product mockups from a prompt"	Image-generation model with policy controls	The output is new media, not analysis of existing media.
"Block unsafe pictures or embedded attacks"	Content Safety plus prompt shields	Visual inputs can carry harmful content or hidden instructions.

Worked example. An accessibility team needs alt text for thousands of product photos plus a short detailed description for screen readers. The correct answer is a multimodal model producing concise and extended captions aligned to accessibility guidelines, not OCR (there may be no text) and not object detection (a list of boxes is not a sentence). The AI-103 blueprint lists exactly this: "generation of alt-text and extended image descriptions aligned to accessibility guidelines."

Image- and video-generation questions add prompts, reference media, inpainting, mask-based edits, and prompt-driven modifications. The common trap is forgetting controls. Generation must enforce safety policies, brand rules, watermarks or provenance when required, and human-review paths for sensitive use. Visual safety is broader than adult or violent content; the blueprint explicitly names watermarks, flagging prohibited symbols, upholding brand-usage requirements, and detecting potentially inappropriate content.

Speech and Language Decisions

Speech workloads are audio-first. Speech-to-text turns spoken input into transcripts for captions, call-center analytics, meeting summaries, or agent turns. Choose the transcription mode by latency and size:

Real-time transcription for live conversations and streaming audio where the user needs words as they are spoken.
Fast transcription for short recordings that need a quick, predictable synchronous result.
Batch transcription for large files or offline jobs where you submit audio and poll for results.

Text-to-speech produces neural-voice audio. Speech Synthesis Markup Language (SSML) controls pronunciation, pauses, rate, pitch, volume, and speaking style. Custom speech (acoustic and language adaptation) and custom neural voice are powerful, but a custom synthetic voice is a responsible-AI scenario: it requires Microsoft access approval, recorded speaker consent, and disclosure that the voice is synthetic. Speech translation turns spoken audio in one language into text or speech in another, and is distinct from text translation.

Language workloads are text-first. The table below maps the common Azure AI Language tasks the exam tests:

Task	What it returns	Typical scenario
Named entity recognition (NER)	People, organizations, dates, locations, products, custom labels	Tagging support tickets or contracts
Sentiment analysis and opinion mining	Positive/negative/neutral plus aspect-level opinions	Scoring product reviews
Personally identifiable information (PII) detection	Spans of sensitive data with categories, plus redaction	Masking phone numbers before storage
Key phrase extraction and summarization	Salient phrases, extractive or abstractive summaries	Condensing call transcripts
Language detection	Detected language and confidence	Routing multilingual chat

Azure Translator handles text and document translation (preserving format for Word, PDF, and similar files). Pair Language with Translator when a workflow must, for example, detect language, translate to English, then run NER and PII redaction.

Exam Selection Pattern

Use this order on every vision, speech, or language item:

Identify the input modality: image, video, audio, plain text, document, or mixed.
Identify the output: caption, transcript, fields, answer, sentiment, translation, spoken audio, or generated media.
Add safety: content filters, PII redaction, prompt shields, human review, and audit logs where the scenario is sensitive.
Decide specialized service vs multimodal model. Specialized services win for predictable extraction and policy tasks; multimodal models win when flexible reasoning over mixed inputs is the core requirement.

The most common wrong answer ignores safety entirely or reaches for image generation when the task is analysis. A strong answer names the service, the reason, and the control in one sentence: "Use a multimodal model for the screenshot, treat its text as untrusted, and run PII detection before storage."

Test Your Knowledge

A support agent must accept a customer screenshot, answer questions about what is visible, ignore any instructions embedded inside the screenshot, transcribe a short voice note, and redact personal data before storing the case summary. Which design best matches the workload?

Use a multimodal model for screenshot reasoning, treat OCR text from the image as untrusted input, use Azure AI Speech for transcription, and apply PII detection before storage

Use only Azure AI Search because the problem includes text and the final output is a summary

Use text-to-speech first, then send the audio output to object detection and skip PII handling

Use image generation because screenshots are images and generated images can be analyzed later

Test Your Knowledge

A team needs live captions during a webinar, a quick synchronous transcript of a 20-second voicemail, and overnight processing of thousands of archived recordings. Which Azure AI Speech transcription modes fit, in that order?

Batch, real-time, fast

Real-time, fast, batch

Fast, batch, real-time

Real-time, batch, fast

Up Next

Image and Video Generation, Editing, and Content Understanding Pipelines

Continue learning

Microsoft Azure AI Apps and Agents Developer Associate

Microsoft Azure AI App and Agent Developer (AI-103)

Vision, Multimodal, Speech, and Language Workloads

Key Takeaways

Vision, Multimodal, Speech, and Language Workloads

Vision and Multimodal Decisions

Speech and Language Decisions

Exam Selection Pattern

Microsoft Azure AI Apps and Agents Developer Associate

1AI-103 Blueprint, Microsoft Foundry, and Solution Planning

2Generative AI, Agents, and Retrieval-Augmented Generation

3Vision, Language, Information Extraction, and Final Review

Microsoft Azure AI App and Agent Developer (AI-103)

Vision, Multimodal, Speech, and Language Workloads

Key Takeaways

Vision, Multimodal, Speech, and Language Workloads

Vision and Multimodal Decisions

Speech and Language Decisions

Exam Selection Pattern