3.2 Vision and Image Workloads
Key Takeaways
- Azure AI Vision is the service to map to existing-image analysis, including OCR, captions, dense captions, tags, object detection, people detection, and smart crop where supported.
- OCR reads visible printed or handwritten text; object detection locates individual objects; image classification or tagging describes the overall visual content.
- A deployed multimodal model can interpret visual input in a prompt, such as answering questions about a screenshot, chart, or product photo.
- Image generation is different from image analysis: it creates new visual output from prompts or image inputs instead of extracting facts from an existing image.
- Vision implementations still need data privacy review, content safety, region and model availability checks, and clear user expectations when generated or inferred visuals are used.
Analyze Existing Images Or Create New Ones
Computer vision workloads begin with visual input. The exam then asks what the app must do with that input. Azure AI Vision analyzes existing images. A multimodal model can reason over visual input as part of a prompt. An image-generation model creates new images from instructions or editing inputs. Those are related, but they are not interchangeable.
AI-901 scenarios usually do not require deep model math. They test whether you can match the task to the right visual capability and explain how a lightweight Foundry app would consume it.
Vision Capability Map
| Need | Capability | Typical output |
|---|---|---|
| Read text on signs, labels, forms, or screenshots | Optical character recognition (OCR) | Words, lines, bounding regions, confidence |
| Describe what an image shows | Image captions or dense captions | Human-readable descriptions |
| Label visual features in an image | Tags or image classification | Labels such as indoor, product, vehicle, or tool |
| Find items and where they appear | Object detection | Object labels with bounding boxes |
| Locate people in a scene | People detection | Bounding boxes and confidence |
| Crop around the important region | Smart crop | Area-of-interest coordinates |
| Ask a question about an uploaded image | Deployed multimodal model | Text answer based on image and prompt |
| Make a new visual from instructions | Image-generation model | Generated image output |
OCR, Image Analysis, And Object Detection
OCR is the right answer when the goal is to read visible text. A warehouse app that reads serial numbers from device photos, a mobile app that copies whiteboard text, or a kiosk that reads a posted notice is using OCR. OCR can return text with location and confidence so the app can show where the words were found.
Image analysis goes beyond text. It can create captions, assign tags, identify objects, detect people, and choose a smart crop. Object detection is stronger than image classification when location matters. Classifying a whole image as warehouse, beach, or receipt is not the same as locating each forklift, pallet, or price tag with a bounding box.
Multimodal Prompting
AI-901 now includes interpreting visual input in prompts by using a deployed multimodal model. This is not the same as classic OCR or object detection. A multimodal model can combine the image with a user question, such as: What is wrong with this chart? Which product in this photo has the damaged corner? What steps are shown in this screenshot?
Use a multimodal model when the app needs reasoning over visual and text context together. Use Azure AI Vision when the app needs a stable visual feature such as OCR, tags, captions, or object boxes. A production app can also combine them: first extract text and objects, then pass selected context to a model for a grounded explanation.
Image Generation
Image generation creates visual output. The user might request concept art, a product mockup, an illustration, or an edited image. That is a generative AI workload, not an extraction workload. Foundry and Azure OpenAI image-generation documentation emphasize model deployment, prompt-based generation, API or playground use, and model availability differences. Treat exact model names and access rules as changeable; the exam concept is stable.
Generated images require responsible AI review. Apps should label generated content where appropriate, filter unsafe requests, avoid misuse of likeness or protected material, and account for brand, copyright, and safety policies. A safe image-generation workflow is not just a prompt box.
Lightweight Foundry Build Process
Use this process for vision scenarios:
- Identify whether the app analyzes an existing visual or creates a new one.
- For analysis, choose Azure AI Vision, OCR, or a multimodal model based on the needed output.
- For generation, choose an approved image-generation model or tool and deploy it where required.
- Test examples in a Foundry playground, Vision quickstart, or relevant studio surface.
- Call the endpoint from the app using SDK or REST authentication.
- Add content safety, privacy controls, and human review for high-impact decisions.
The fastest exam check is this: OCR reads text, object detection locates things, captions describe images, multimodal models answer visual questions, and image generation creates new visuals.
A retail app receives shelf photos and must return the location of each missing price label so an employee can inspect the exact area. Which capability best matches the location requirement?
A design team wants several new packaging concepts from written creative prompts, not an analysis of an existing package photo. Which workload should they use?