3.1 Azure AI Vision — Image Analysis 4.0

Key Takeaways

Image Analysis 4.0 delivers captioning, dense captions, tagging, object detection, people detection, smart crops, and OCR in a single Analyze call powered by the Florence foundation model.
Visual features are passed as a list (Caption, DenseCaptions, Tags, Objects, People, SmartCrops, Read); only requested features are billed and returned.
Caption and DenseCaptions are GA only in a limited set of Azure regions; requesting Caption from an unsupported region returns an error, a frequent exam scenario.
Input accepts a public URL or binary bytes up to 20 MB, between 50x50 and 16000x16000 pixels, in JPEG, PNG, GIF, BMP, WEBP, ICO, TIFF, or MPO.
Custom image classification and object detection models train on as few as 2-5 images per label using Florence transfer learning via Vision Studio or the REST API.

Last updated: June 2026

Quick Answer: Image Analysis 4.0 uses the Florence foundation model to return captioning, dense captions, tags, objects, people, smart crops, and OCR from a single analyze call. You request features with the features (REST) or visual_features (SDK) parameter and read each one from its own result object. Caption and DenseCaptions are limited to specific regions.

What the Analyze API Returns

Image Analysis 4.0 is exposed at the imageanalysis:analyze endpoint and replaces the legacy Computer Vision 3.x analyze and describe operations. A single round trip can return up to seven analyses, each charged as a separate transaction.

Visual feature	What it produces	Result property
Caption	One human-readable sentence for the whole image	`caption.text`, `caption.confidence`
DenseCaptions	Up to 10 region captions, each with a bounding box	`dense_captions.list[]`
Tags	Content tags with confidence (no boxes)	`tags.list[]`
Objects	Object class + bounding box	`objects.list[].tags`, `.bounding_box`
People	People bounding boxes + confidence	`people.list[]`
SmartCrops	Crop coordinates per requested aspect ratio	`smart_crops.list[]`
Read	Printed/handwritten text (OCR)	`read.blocks[].lines[].words`

Region and Input Limits (high-value trap)

Caption and DenseCaptions are GA only in a subset of regions (East US, West US, West Europe, France Central, Korea Central, North Europe, Southeast Asia, West US 2, East Asia, Switzerland North, Sweden Central, and a few more). Tags, Objects, People, SmartCrops, and Read run in all Vision regions. If a question shows a 400/feature-unsupported error after adding Caption, the fix is to deploy the resource in a supported region — not to change the SDK.
Image bytes must be 50x50 to 16000x16000 pixels and at most 20 MB.
Set gender_neutral_caption=True to replace gendered nouns ("man", "woman") with "person" — a Responsible AI default many teams require.
smart_crops_aspect_ratios accepts values from 0.75 to 1.8 (e.g., 0.9 for portrait thumbnails, 1.33 for 4:3). Omit it and the service picks the best single crop.

Calling the API

REST — features are a comma-separated query string and the body carries the image:

POST {endpoint}/computervision/imageanalysis:analyze?api-version=2024-02-01&features=caption,read,tags&gender-neutral-caption=true&language=en
Ocp-Apim-Subscription-Key: <key>
Content-Type: application/json

{ "url": "https://example.com/photo.jpg" }

Python SDK — analyze_from_url (or analyze for bytes) takes a VisualFeatures list:

from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.ai.vision.imageanalysis.models import VisualFeatures
from azure.core.credentials import AzureKeyCredential

client = ImageAnalysisClient(endpoint, AzureKeyCredential(key))
result = client.analyze_from_url(
    image_url="https://example.com/photo.jpg",
    visual_features=[VisualFeatures.CAPTION, VisualFeatures.TAGS,
                     VisualFeatures.OBJECTS, VisualFeatures.READ],
    gender_neutral_caption=True,
    language="en")
print(result.caption.text, result.caption.confidence)
for t in result.tags.list:
    print(t.name, t.confidence)

Reading the JSON Response

Each feature lands in its own object; missing features are simply absent. Watch the property names — the exam routinely shows JSON and asks for the line that extracts a value.

{
  "captionResult": { "text": "a person walking a dog in a park", "confidence": 0.8745 },
  "tagsResult": { "values": [ {"name":"outdoor","confidence":0.99} ] },
  "objectsResult": { "values": [ {"tags":[{"name":"dog","confidence":0.92}],
    "boundingBox":{"x":250,"y":300,"w":150,"h":120}} ] }
}

Worked Example

A news site needs alt text plus 16:9 and 1:1 social thumbnails from one upload. Request features=caption,smartCrops, set smart_crops_aspect_ratios=[1.78, 1.0], set gender_neutral_caption=True, and store caption.text as alt text and each smart_crops.list[].bounding_box as a crop rectangle — one call, no separate cropping service.

Tags, Objects, and People: Subtle Distinctions

Candidates lose points by conflating the three detection features, because their outputs overlap in everyday language but differ sharply in the response. Tags are image-level keywords with confidence and no spatial location — the model may tag an image "dog" even when the dog is tiny in a corner. Objects return the same class names but each comes with a bounding_box, so you know where the dog is. People is a specialized detector that returns only person bounding boxes with confidence and is the correct choice when you must count or locate humans specifically rather than enumerate every object class.

A common scenario asks how to count shoppers in a frame for analytics: the right feature is People, not Objects filtered to "person," because People is tuned for crowded scenes and partial occlusion.

Choosing Image Analysis vs. Other Vision Services

Image Analysis 4.0 is the prebuilt, no-training option. Reach for Custom Vision or custom Image Analysis models only when the standard tag and object taxonomy cannot name your domain concepts (for example, "cracked weld" or a specific SKU). Reach for Document Intelligence when the picture is really a document and you need fields rather than tags. The exam frequently frames this as a single-best-answer choice: if the requirement is "describe arbitrary photos and pull any visible text with zero training," the answer is Image Analysis 4.0 with Caption + Read, not a trained model.

Custom Image Analysis Models

Image Analysis 4.0 also trains custom classifiers and detectors using Florence transfer learning, so they need far less data than training from scratch.

Step	Detail
Dataset	Create in Vision Studio or via API; COCO-format annotations for detection
Minimum images	2 per label (absolute); 15+ recommended for reliable accuracy
Train	Service-managed; you set a training budget (hours)
Evaluate	Precision, recall, and mAP reported per label
Use	Pass `model_name="my-model"` to `analyze` instead of standard features

result = client.analyze_from_url(image_url=url,
    visual_features=[VisualFeatures.TAGS], model_name="product-classifier")

On the Exam: Memorize which features are region-limited (Caption, DenseCaptions), the 20 MB / 16000px input ceiling, and that gender_neutral_caption is the Responsible AI knob. Custom-model questions hinge on the 2-image minimum and that a custom call replaces the visual-features list with model_name.

Test Your Knowledge

Which foundation model powers Azure AI Vision Image Analysis 4.0?

GPT-4 Vision

DALL-E 3

Florence

ResNet-50

Test Your Knowledge

A developer adds Caption to an Image Analysis call and starts receiving a feature-not-supported error, while Tags and Read still work. What is the most likely cause?

The image exceeds 20 MB

Caption requires a custom trained model

gender_neutral_caption must be set to true

The Vision resource is deployed in a region where Caption is not available

Test Your Knowledge

Which visual feature generates natural-language captions for multiple regions within a single image, each with its own bounding box?

Azure AI Engineer Associate

Azure AI-102

3.1 Azure AI Vision — Image Analysis 4.0

Key Takeaways

What the Analyze API Returns

Region and Input Limits (high-value trap)

Calling the API

Reading the JSON Response

Worked Example

Tags, Objects, and People: Subtle Distinctions

Choosing Image Analysis vs. Other Vision Services

Custom Image Analysis Models

Azure AI Engineer Associate

1Introduction

2Domain 1: Plan and Manage an Azure AI Solution (20-25%)

3Content Safety and Moderation (within Plan and Manage, Domain 1)

4Domain 4: Implement Computer Vision Solutions (10-15%)

5Domain 5: Implement Natural Language Processing Solutions (15-20%)

6Domain 6: Implement Knowledge Mining and Information Extraction Solutions (15-20%)

7Domain 2: Implement Generative AI Solutions (15-20%)

8Domain 3: Implement an Agentic Solution (5-10%)

9Exam Review: Cross-Domain Topics and Advanced Practice

Azure AI-102

3.1 Azure AI Vision — Image Analysis 4.0

Key Takeaways

What the Analyze API Returns

Region and Input Limits (high-value trap)

Calling the API

Reading the JSON Response

Worked Example

Tags, Objects, and People: Subtle Distinctions

Choosing Image Analysis vs. Other Vision Services

Custom Image Analysis Models