3.1 Azure AI Vision — Image Analysis 4.0

Key Takeaways

  • Image Analysis 4.0 delivers captioning, dense captions, tagging, object detection, people detection, smart crops, and OCR in a single Analyze call powered by the Florence foundation model.
  • Visual features are passed as a list (Caption, DenseCaptions, Tags, Objects, People, SmartCrops, Read); only requested features are billed and returned.
  • Caption and DenseCaptions are GA only in a limited set of Azure regions; requesting Caption from an unsupported region returns an error, a frequent exam scenario.
  • Input accepts a public URL or binary bytes up to 20 MB, between 50x50 and 16000x16000 pixels, in JPEG, PNG, GIF, BMP, WEBP, ICO, TIFF, or MPO.
  • Custom image classification and object detection models train on as few as 2-5 images per label using Florence transfer learning via Vision Studio or the REST API.
Last updated: June 2026

Quick Answer: Image Analysis 4.0 uses the Florence foundation model to return captioning, dense captions, tags, objects, people, smart crops, and OCR from a single analyze call. You request features with the features (REST) or visual_features (SDK) parameter and read each one from its own result object. Caption and DenseCaptions are limited to specific regions.

What the Analyze API Returns

Image Analysis 4.0 is exposed at the imageanalysis:analyze endpoint and replaces the legacy Computer Vision 3.x analyze and describe operations. A single round trip can return up to seven analyses, each charged as a separate transaction.

Visual featureWhat it producesResult property
CaptionOne human-readable sentence for the whole imagecaption.text, caption.confidence
DenseCaptionsUp to 10 region captions, each with a bounding boxdense_captions.list[]
TagsContent tags with confidence (no boxes)tags.list[]
ObjectsObject class + bounding boxobjects.list[].tags, .bounding_box
PeoplePeople bounding boxes + confidencepeople.list[]
SmartCropsCrop coordinates per requested aspect ratiosmart_crops.list[]
ReadPrinted/handwritten text (OCR)read.blocks[].lines[].words

Region and Input Limits (high-value trap)

  • Caption and DenseCaptions are GA only in a subset of regions (East US, West US, West Europe, France Central, Korea Central, North Europe, Southeast Asia, West US 2, East Asia, Switzerland North, Sweden Central, and a few more). Tags, Objects, People, SmartCrops, and Read run in all Vision regions. If a question shows a 400/feature-unsupported error after adding Caption, the fix is to deploy the resource in a supported region — not to change the SDK.
  • Image bytes must be 50x50 to 16000x16000 pixels and at most 20 MB.
  • Set gender_neutral_caption=True to replace gendered nouns ("man", "woman") with "person" — a Responsible AI default many teams require.
  • smart_crops_aspect_ratios accepts values from 0.75 to 1.8 (e.g., 0.9 for portrait thumbnails, 1.33 for 4:3). Omit it and the service picks the best single crop.

Calling the API

REST — features are a comma-separated query string and the body carries the image:

POST {endpoint}/computervision/imageanalysis:analyze?api-version=2024-02-01&features=caption,read,tags&gender-neutral-caption=true&language=en
Ocp-Apim-Subscription-Key: <key>
Content-Type: application/json

{ "url": "https://example.com/photo.jpg" }

Python SDKanalyze_from_url (or analyze for bytes) takes a VisualFeatures list:

from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.ai.vision.imageanalysis.models import VisualFeatures
from azure.core.credentials import AzureKeyCredential

client = ImageAnalysisClient(endpoint, AzureKeyCredential(key))
result = client.analyze_from_url(
    image_url="https://example.com/photo.jpg",
    visual_features=[VisualFeatures.CAPTION, VisualFeatures.TAGS,
                     VisualFeatures.OBJECTS, VisualFeatures.READ],
    gender_neutral_caption=True,
    language="en")
print(result.caption.text, result.caption.confidence)
for t in result.tags.list:
    print(t.name, t.confidence)

Reading the JSON Response

Each feature lands in its own object; missing features are simply absent. Watch the property names — the exam routinely shows JSON and asks for the line that extracts a value.

{
  "captionResult": { "text": "a person walking a dog in a park", "confidence": 0.8745 },
  "tagsResult": { "values": [ {"name":"outdoor","confidence":0.99} ] },
  "objectsResult": { "values": [ {"tags":[{"name":"dog","confidence":0.92}],
    "boundingBox":{"x":250,"y":300,"w":150,"h":120}} ] }
}

Worked Example

A news site needs alt text plus 16:9 and 1:1 social thumbnails from one upload. Request features=caption,smartCrops, set smart_crops_aspect_ratios=[1.78, 1.0], set gender_neutral_caption=True, and store caption.text as alt text and each smart_crops.list[].bounding_box as a crop rectangle — one call, no separate cropping service.

Tags, Objects, and People: Subtle Distinctions

Candidates lose points by conflating the three detection features, because their outputs overlap in everyday language but differ sharply in the response. Tags are image-level keywords with confidence and no spatial location — the model may tag an image "dog" even when the dog is tiny in a corner. Objects return the same class names but each comes with a bounding_box, so you know where the dog is. People is a specialized detector that returns only person bounding boxes with confidence and is the correct choice when you must count or locate humans specifically rather than enumerate every object class.

A common scenario asks how to count shoppers in a frame for analytics: the right feature is People, not Objects filtered to "person," because People is tuned for crowded scenes and partial occlusion.

Choosing Image Analysis vs. Other Vision Services

Image Analysis 4.0 is the prebuilt, no-training option. Reach for Custom Vision or custom Image Analysis models only when the standard tag and object taxonomy cannot name your domain concepts (for example, "cracked weld" or a specific SKU). Reach for Document Intelligence when the picture is really a document and you need fields rather than tags. The exam frequently frames this as a single-best-answer choice: if the requirement is "describe arbitrary photos and pull any visible text with zero training," the answer is Image Analysis 4.0 with Caption + Read, not a trained model.

Custom Image Analysis Models

Image Analysis 4.0 also trains custom classifiers and detectors using Florence transfer learning, so they need far less data than training from scratch.

StepDetail
DatasetCreate in Vision Studio or via API; COCO-format annotations for detection
Minimum images2 per label (absolute); 15+ recommended for reliable accuracy
TrainService-managed; you set a training budget (hours)
EvaluatePrecision, recall, and mAP reported per label
UsePass model_name="my-model" to analyze instead of standard features
result = client.analyze_from_url(image_url=url,
    visual_features=[VisualFeatures.TAGS], model_name="product-classifier")

On the Exam: Memorize which features are region-limited (Caption, DenseCaptions), the 20 MB / 16000px input ceiling, and that gender_neutral_caption is the Responsible AI knob. Custom-model questions hinge on the 2-image minimum and that a custom call replaces the visual-features list with model_name.

Test Your Knowledge

Which foundation model powers Azure AI Vision Image Analysis 4.0?

A
B
C
D
Test Your Knowledge

A developer adds Caption to an Image Analysis call and starts receiving a feature-not-supported error, while Tags and Read still work. What is the most likely cause?

A
B
C
D
Test Your Knowledge

Which visual feature generates natural-language captions for multiple regions within a single image, each with its own bounding box?

A
B
C
D
Test Your Knowledge

What is the minimum number of labeled images required per tag to train a custom Image Analysis 4.0 model?

A
B
C
D