Vision, Multimodal, Speech, and Language Workloads
Key Takeaways
- AI-103 vision questions are Foundry-centered: expect multimodal reasoning, visual question answering, captions, image or video generation, OCR, and visual safety controls.
- Azure AI Vision OCR reads printed and handwritten text, but richer document structure usually belongs to Document Intelligence or Content Understanding.
- Speech-to-text, text-to-speech, speech translation, Voice Live, and custom speech models support voice agents and audio workflows.
- Azure AI Language and Translator cover entity extraction, sentiment, PII detection, summarization, question answering, and text or document translation.
- Treat text inside images, screenshots, PDFs, and retrieved media as untrusted input because it can carry indirect prompt injection.
Vision, Multimodal, Speech, and Language Workloads
Quick Answer: For AI-103, do not think of vision and language as isolated legacy APIs. Microsoft tests whether you can choose the right Foundry-era capability for a workload: a multimodal model for visual reasoning, Azure AI Vision OCR for reading text in images, Content Understanding for structured multimodal extraction, Azure AI Speech for voice, Azure AI Language for text analytics, Translator for translation, and Content Safety for unsafe text or images.
Vision and Multimodal Decisions
A computer vision question usually starts with the input type: image, video, screenshot, scanned document, or generated media. Then look for the required output. If the app needs a caption, alt text, or an answer grounded in visible evidence, a multimodal model or Foundry visual understanding workflow is the likely starting point. If the output must locate visible items, the clue is object, component, region, bounding box, or zone. If the clue is text printed on a label, whiteboard, receipt, or package, the first capability is optical character recognition (OCR).
| Requirement clue | Strong answer | Why it wins |
|---|---|---|
| "Describe this screenshot" or "answer from this image" | Multimodal model | The model reasons over text plus pixels in one prompt. |
| "Read serial numbers in photos" | Azure AI Vision OCR / Read | The goal is machine-readable text, not a general caption. |
| "Extract fields from a form" | Document Intelligence or Content Understanding | OCR alone does not produce reliable business fields. |
| "Generate product mockups" | Image generation model with policy controls | The output is new media, not analysis of existing media. |
| "Block unsafe pictures or embedded attacks" | Content Safety plus prompt-injection defenses | Visual inputs can carry harmful content or hidden instructions. |
Image and video generation questions often include text prompts, reference media, inpainting, mask-based edits, or prompt-driven modifications. The exam trap is to forget controls: generation should enforce safety policies, brand rules, watermarks or provenance when required, and review paths for sensitive use cases. Visual safety is broader than adult or violent content. It includes prohibited symbols, unsafe generated media, brand misuse, and prompt injection hidden in screenshots or scanned pages.
Speech and Language Decisions
Speech workloads are audio-first. Speech-to-text converts spoken input into transcripts for captions, call-center analytics, meeting summaries, or agent turns. Use real-time transcription for live interactions, fast transcription for short recordings that need predictable latency, and batch transcription for large files or offline processing.
Text-to-speech creates spoken output with neural voices; Speech Synthesis Markup Language (SSML) controls pronunciation, pauses, rate, pitch, volume, and speaking style. Custom speech and custom voice are powerful, but custom synthetic voice is a responsible-AI scenario with access, consent, and disclosure considerations.
Language workloads are text-first. Named entity recognition (NER) finds people, organizations, dates, locations, products, or custom labels. Sentiment analysis estimates tone and opinion in feedback or support text. Personally identifiable information (PII) detection finds and can help redact sensitive personal data in text, conversations, and document workflows. Translator handles text and document translation, while speech translation handles spoken audio.
Exam Selection Pattern
Use this quick selection order:
- Identify the input modality: image, video, audio, plain text, document, or mixed content.
- Identify the output: caption, text, fields, answer, sentiment, translation, spoken audio, or generated media.
- Add safety: content filters, PII redaction, prompt shields, human review, and audit logs where the scenario is sensitive.
- Decide whether a specialized service or a multimodal model is more appropriate. Specialized services win for predictable extraction and policy tasks; multimodal models win when flexible reasoning over mixed inputs is the core requirement.
A support agent must accept a customer screenshot, answer questions about what is visible, ignore any instructions embedded inside the screenshot, transcribe a short voice note, and redact personal data before storing the case summary. Which design best matches the workload?