3.2 Azure AI Vision Service

Key Takeaways

  • Azure AI Vision (formerly Computer Vision) is the primary Azure service for image and video analysis — it provides pre-built models for common vision tasks.
  • The Image Analysis 4.0 API supports image captioning, tagging, object detection, smart cropping, people detection, and background removal.
  • Azure AI Vision OCR (Read API) extracts printed and handwritten text from images and documents with high accuracy.
  • Spatial Analysis uses video streams to detect people and track their movements in physical spaces (retail analytics, occupancy monitoring).
  • You access Azure AI Vision through the REST API or SDKs — no custom model training is needed for pre-built capabilities.
Last updated: March 2026

Azure AI Vision Service

Quick Answer: Azure AI Vision is the primary Azure service for image and video analysis. It provides pre-built models for image captioning, tagging, object detection, OCR, spatial analysis, and more. No custom model training is needed — you send an image to the API and receive analysis results.

What Is Azure AI Vision?

Azure AI Vision (formerly known as Azure Computer Vision) is a cloud-based service that provides pre-built computer vision capabilities. You send images or video to the service, and it returns structured analysis results.

Key Capabilities

CapabilityDescriptionExample Output
Image captioningGenerate a natural language description of an image"A dog playing fetch in a park"
Image taggingAssign descriptive tags to an image["dog", "outdoor", "park", "grass", "playing"]
Object detectionIdentify and locate objects with bounding boxesObject: "dog" at [100, 200, 150, 180]
Smart croppingAutomatically crop images around regions of interestFocused crop around the main subject
People detectionDetect and locate people in imagesPerson locations with bounding boxes
Background removalSeparate foreground from backgroundForeground mask or transparent background
OCR (Read)Extract printed and handwritten text"Invoice #12345, Date: March 2026"
Spatial analysisAnalyze video for people counting and movement15 people in zone A, average dwell time 3 min

Image Analysis 4.0 API

The latest version of the Image Analysis API (4.0) is powered by Florence, a large-scale vision foundation model. Key improvements include:

  • Better captioning — more natural and accurate image descriptions
  • Dense captioning — descriptions for multiple regions in an image
  • Image retrieval — search through images using text queries (vector search)
  • Customization — add your own categories using few-shot learning (minimal training data)

How to Use Image Analysis

  1. Create an Azure AI Vision resource in the Azure portal
  2. Send an image to the REST API or use the SDK
  3. Specify which visual features you want (caption, tags, objects, etc.)
  4. Receive structured JSON results
Request: POST /imageanalysis:analyze?features=caption,tags,objects
Image: [photo of a park scene]

Response:
{
  "caption": "A golden retriever playing fetch in a sunny park",
  "tags": ["dog", "golden retriever", "park", "grass", "ball", "outdoor"],
  "objects": [
    {"name": "dog", "confidence": 0.97, "boundingBox": {...}},
    {"name": "ball", "confidence": 0.89, "boundingBox": {...}}
  ]
}

On the Exam: Know that Azure AI Vision provides PRE-BUILT capabilities — you do not need to train a model. Send an image, get results. If a question asks about training a custom image model, that is Azure AI Custom Vision (different service).

OCR with the Read API

The Read API is Azure AI Vision's OCR capability for extracting text from images and documents:

Supported Content

  • Printed text in 164+ languages
  • Handwritten text in English, Chinese, French, German, Italian, Japanese, Korean, Portuguese, Spanish
  • Mixed content — images with both printed and handwritten text
  • Document formats — JPEG, PNG, BMP, PDF, TIFF

Read API Process

  1. Submit an image or document to the Read API
  2. The service processes the image asynchronously
  3. Retrieve results with extracted text, line positions, and word positions
  4. Results include confidence scores for each extracted word

Common OCR Use Cases

  • Digitizing paper documents and forms
  • Reading license plates from camera images
  • Extracting data from receipts and invoices
  • Converting handwritten notes to searchable text
  • Indexing text in image-heavy documents

Spatial Analysis

Spatial Analysis uses video streams from cameras to understand how people move through physical spaces:

CapabilityDescriptionUse Case
People countingCount people entering/exiting an areaRetail foot traffic analysis
Social distancingMeasure distance between peopleWorkplace safety compliance
Zone dwell timeTrack how long people stay in areasRetail store layout optimization
Queue monitoringCount people in lines and estimate wait timesCustomer service improvement
Movement trackingTrack paths people take through a spaceFacility layout planning

On the Exam: Spatial Analysis requires a camera connected to an Azure IoT Edge device. It processes video locally (edge computing) and sends only aggregated analytics to the cloud — not individual faces or video frames.

When to Use Azure AI Vision vs. Other Services

ScenarioService to Use
Analyze a single image for tags, captions, objectsAzure AI Vision
Extract text from a scanned documentAzure AI Vision (Read API) or Azure AI Document Intelligence
Train a custom image classifier with your own categoriesAzure AI Custom Vision
Detect and verify human facesAzure AI Face
Generate images from text descriptionsAzure OpenAI Service (DALL-E / GPT Image)
Analyze video for scene detection and transcriptionAzure AI Video Indexer
Count people in a physical space using videoAzure AI Vision (Spatial Analysis)
Extract structured data from forms and invoicesAzure AI Document Intelligence
Test Your Knowledge

Which Azure service provides pre-built image analysis capabilities including captioning, tagging, and object detection WITHOUT requiring custom model training?

A
B
C
D
Test Your Knowledge

A company wants to count the number of people entering and exiting their retail stores using existing security cameras. Which Azure AI Vision capability should they use?

A
B
C
D
Test Your Knowledge

Which Azure AI Vision capability would you use to extract text from a photograph of a handwritten note?

A
B
C
D
Test Your Knowledge

What is the difference between Azure AI Vision and Azure AI Custom Vision?

A
B
C
D