3.2 Vision and Image Workloads

Key Takeaways

Azure AI Vision is the service to map to existing-image analysis, including OCR, captions, dense captions, tags, object detection, people detection, and smart crop where supported.
OCR reads visible printed or handwritten text; object detection locates individual objects; image classification or tagging describes the overall visual content.
A deployed multimodal model can interpret visual input in a prompt, such as answering questions about a screenshot, chart, or product photo.
Image generation is different from image analysis: it creates new visual output from prompts or image inputs instead of extracting facts from an existing image.
Vision implementations still need data privacy review, content safety, region and model availability checks, and clear user expectations when generated or inferred visuals are used.

Last updated: June 2026

Analyze Existing Images Or Create New Ones

Computer vision workloads begin with visual input. The exam then asks what the app must do with that input. Azure AI Vision analyzes existing images. A multimodal model can reason over visual input as part of a prompt. An image-generation model creates new images from instructions or editing inputs. Those are related, but they are not interchangeable.

AI-901 scenarios usually do not require deep model math. They test whether you can match the task to the right visual capability and explain how a lightweight Foundry app would consume it.

Vision Capability Map

Need	Capability	Typical output
Read text on signs, labels, forms, or screenshots	Optical character recognition (OCR)	Words, lines, bounding regions, confidence
Describe what an image shows	Image captions or dense captions	Human-readable descriptions
Label visual features in an image	Tags or image classification	Labels such as indoor, product, vehicle, or tool
Find items and where they appear	Object detection	Object labels with bounding boxes
Locate people in a scene	People detection	Bounding boxes and confidence
Crop around the important region	Smart crop	Area-of-interest coordinates
Ask a question about an uploaded image	Deployed multimodal model	Text answer based on image and prompt
Make a new visual from instructions	Image-generation model	Generated image output

OCR, Image Analysis, And Object Detection

OCR is the right answer when the goal is to read visible text. A warehouse app that reads serial numbers from device photos, a mobile app that copies whiteboard text, or a kiosk that reads a posted notice is using OCR. OCR can return text with location and confidence so the app can show where the words were found.

Image analysis goes beyond text. It can create captions, assign tags, identify objects, detect people, and choose a smart crop. Object detection is stronger than image classification when location matters. Classifying a whole image as warehouse, beach, or receipt is not the same as locating each forklift, pallet, or price tag with a bounding box.

Multimodal Prompting

AI-901 now includes interpreting visual input in prompts by using a deployed multimodal model. This is not the same as classic OCR or object detection. A multimodal model can combine the image with a user question, such as: What is wrong with this chart? Which product in this photo has the damaged corner? What steps are shown in this screenshot?

Use a multimodal model when the app needs reasoning over visual and text context together. Use Azure AI Vision when the app needs a stable visual feature such as OCR, tags, captions, or object boxes. A production app can also combine them: first extract text and objects, then pass selected context to a model for a grounded explanation.

Image Generation

Image generation creates visual output. The user might request concept art, a product mockup, an illustration, or an edited image. That is a generative AI workload, not an extraction workload. Foundry and Azure OpenAI image-generation documentation emphasize model deployment, prompt-based generation, API or playground use, and model availability differences. Treat exact model names and access rules as changeable; the exam concept is stable.

Generated images require responsible AI review. Apps should label generated content where appropriate, filter unsafe requests, avoid misuse of likeness or protected material, and account for brand, copyright, and safety policies. A safe image-generation workflow is not just a prompt box.

Lightweight Foundry Build Process

Use this process for vision scenarios:

Identify whether the app analyzes an existing visual or creates a new one.
For analysis, choose Azure AI Vision, OCR, or a multimodal model based on the needed output.
For generation, choose an approved image-generation model or tool and deploy it where required.
Test examples in a Foundry playground, Vision quickstart, or relevant studio surface.
Call the endpoint from the app using SDK or REST authentication.
Add content safety, privacy controls, and human review for high-impact decisions.

The fastest exam check is this: OCR reads text, object detection locates things, captions describe images, multimodal models answer visual questions, and image generation creates new visuals.

Test Your Knowledge

A retail app receives shelf photos and must return the location of each missing price label so an employee can inspect the exact area. Which capability best matches the location requirement?

Object detection or image analysis that returns bounding boxes.

Sentiment analysis of customer comments.

Text to speech with a neural voice.

A translation model for document batches.

Test Your Knowledge

A design team wants several new packaging concepts from written creative prompts, not an analysis of an existing package photo. Which workload should they use?

Image generation, because the desired output is a new visual asset.

OCR, because every image task starts by reading text.

Speaker recognition, because the prompt is written by a designer.

Named entity recognition, because packaging is a noun phrase.

Up Next

3.3 Content Understanding and Final Review

Continue learning

Microsoft Certified: Azure AI Fundamentals

Microsoft Certified: Azure AI Fundamentals (AI-901)

3.2 Vision and Image Workloads

Key Takeaways

Analyze Existing Images Or Create New Ones

Vision Capability Map

OCR, Image Analysis, And Object Detection

Multimodal Prompting

Image Generation

Lightweight Foundry Build Process

Microsoft Certified: Azure AI Fundamentals

1Chapter 1: AI-901 Format and Responsible AI

2Chapter 2: Microsoft Foundry, Models, and Agents

3Chapter 3: Azure AI Services, Vision, Language, and Extraction

4Chapter 4: AI-901 Scenario and Service Selection

5Chapter 5: Practice Labs, Common Traps, and Final Review

Microsoft Certified: Azure AI Fundamentals (AI-901)

3.2 Vision and Image Workloads

Key Takeaways

Analyze Existing Images Or Create New Ones

Vision Capability Map

OCR, Image Analysis, And Object Detection

Multimodal Prompting

Image Generation

Lightweight Foundry Build Process