3.2 Azure AI Vision Service
Key Takeaways
- Azure AI Vision (formerly Computer Vision) is the primary Azure service for image and video analysis — it provides pre-built models for common vision tasks.
- The Image Analysis 4.0 API supports image captioning, tagging, object detection, smart cropping, people detection, and background removal.
- Azure AI Vision OCR (Read API) extracts printed and handwritten text from images and documents with high accuracy.
- Spatial Analysis uses video streams to detect people and track their movements in physical spaces (retail analytics, occupancy monitoring).
- You access Azure AI Vision through the REST API or SDKs — no custom model training is needed for pre-built capabilities.
Azure AI Vision Service
Quick Answer: Azure AI Vision is the primary Azure service for image and video analysis. It provides pre-built models for image captioning, tagging, object detection, OCR, spatial analysis, and more. No custom model training is needed — you send an image to the API and receive analysis results.
What Is Azure AI Vision?
Azure AI Vision (formerly known as Azure Computer Vision) is a cloud-based service that provides pre-built computer vision capabilities. You send images or video to the service, and it returns structured analysis results.
Key Capabilities
| Capability | Description | Example Output |
|---|---|---|
| Image captioning | Generate a natural language description of an image | "A dog playing fetch in a park" |
| Image tagging | Assign descriptive tags to an image | ["dog", "outdoor", "park", "grass", "playing"] |
| Object detection | Identify and locate objects with bounding boxes | Object: "dog" at [100, 200, 150, 180] |
| Smart cropping | Automatically crop images around regions of interest | Focused crop around the main subject |
| People detection | Detect and locate people in images | Person locations with bounding boxes |
| Background removal | Separate foreground from background | Foreground mask or transparent background |
| OCR (Read) | Extract printed and handwritten text | "Invoice #12345, Date: March 2026" |
| Spatial analysis | Analyze video for people counting and movement | 15 people in zone A, average dwell time 3 min |
Image Analysis 4.0 API
The latest version of the Image Analysis API (4.0) is powered by Florence, a large-scale vision foundation model. Key improvements include:
- Better captioning — more natural and accurate image descriptions
- Dense captioning — descriptions for multiple regions in an image
- Image retrieval — search through images using text queries (vector search)
- Customization — add your own categories using few-shot learning (minimal training data)
How to Use Image Analysis
- Create an Azure AI Vision resource in the Azure portal
- Send an image to the REST API or use the SDK
- Specify which visual features you want (caption, tags, objects, etc.)
- Receive structured JSON results
Request: POST /imageanalysis:analyze?features=caption,tags,objects
Image: [photo of a park scene]
Response:
{
"caption": "A golden retriever playing fetch in a sunny park",
"tags": ["dog", "golden retriever", "park", "grass", "ball", "outdoor"],
"objects": [
{"name": "dog", "confidence": 0.97, "boundingBox": {...}},
{"name": "ball", "confidence": 0.89, "boundingBox": {...}}
]
}
On the Exam: Know that Azure AI Vision provides PRE-BUILT capabilities — you do not need to train a model. Send an image, get results. If a question asks about training a custom image model, that is Azure AI Custom Vision (different service).
OCR with the Read API
The Read API is Azure AI Vision's OCR capability for extracting text from images and documents:
Supported Content
- Printed text in 164+ languages
- Handwritten text in English, Chinese, French, German, Italian, Japanese, Korean, Portuguese, Spanish
- Mixed content — images with both printed and handwritten text
- Document formats — JPEG, PNG, BMP, PDF, TIFF
Read API Process
- Submit an image or document to the Read API
- The service processes the image asynchronously
- Retrieve results with extracted text, line positions, and word positions
- Results include confidence scores for each extracted word
Common OCR Use Cases
- Digitizing paper documents and forms
- Reading license plates from camera images
- Extracting data from receipts and invoices
- Converting handwritten notes to searchable text
- Indexing text in image-heavy documents
Spatial Analysis
Spatial Analysis uses video streams from cameras to understand how people move through physical spaces:
| Capability | Description | Use Case |
|---|---|---|
| People counting | Count people entering/exiting an area | Retail foot traffic analysis |
| Social distancing | Measure distance between people | Workplace safety compliance |
| Zone dwell time | Track how long people stay in areas | Retail store layout optimization |
| Queue monitoring | Count people in lines and estimate wait times | Customer service improvement |
| Movement tracking | Track paths people take through a space | Facility layout planning |
On the Exam: Spatial Analysis requires a camera connected to an Azure IoT Edge device. It processes video locally (edge computing) and sends only aggregated analytics to the cloud — not individual faces or video frames.
When to Use Azure AI Vision vs. Other Services
| Scenario | Service to Use |
|---|---|
| Analyze a single image for tags, captions, objects | Azure AI Vision |
| Extract text from a scanned document | Azure AI Vision (Read API) or Azure AI Document Intelligence |
| Train a custom image classifier with your own categories | Azure AI Custom Vision |
| Detect and verify human faces | Azure AI Face |
| Generate images from text descriptions | Azure OpenAI Service (DALL-E / GPT Image) |
| Analyze video for scene detection and transcription | Azure AI Video Indexer |
| Count people in a physical space using video | Azure AI Vision (Spatial Analysis) |
| Extract structured data from forms and invoices | Azure AI Document Intelligence |
Which Azure service provides pre-built image analysis capabilities including captioning, tagging, and object detection WITHOUT requiring custom model training?
A company wants to count the number of people entering and exiting their retail stores using existing security cameras. Which Azure AI Vision capability should they use?
Which Azure AI Vision capability would you use to extract text from a photograph of a handwritten note?
What is the difference between Azure AI Vision and Azure AI Custom Vision?