3.1 Azure AI Vision — Image Analysis 4.0
Key Takeaways
- Image Analysis 4.0 delivers captioning, dense captions, tagging, object detection, people detection, smart crops, and OCR in a single Analyze call powered by the Florence foundation model.
- Visual features are passed as a list (Caption, DenseCaptions, Tags, Objects, People, SmartCrops, Read); only requested features are billed and returned.
- Caption and DenseCaptions are GA only in a limited set of Azure regions; requesting Caption from an unsupported region returns an error, a frequent exam scenario.
- Input accepts a public URL or binary bytes up to 20 MB, between 50x50 and 16000x16000 pixels, in JPEG, PNG, GIF, BMP, WEBP, ICO, TIFF, or MPO.
- Custom image classification and object detection models train on as few as 2-5 images per label using Florence transfer learning via Vision Studio or the REST API.
Quick Answer: Image Analysis 4.0 uses the Florence foundation model to return captioning, dense captions, tags, objects, people, smart crops, and OCR from a single
analyzecall. You request features with thefeatures(REST) orvisual_features(SDK) parameter and read each one from its own result object. Caption and DenseCaptions are limited to specific regions.
What the Analyze API Returns
Image Analysis 4.0 is exposed at the imageanalysis:analyze endpoint and replaces the legacy Computer Vision 3.x analyze and describe operations. A single round trip can return up to seven analyses, each charged as a separate transaction.
| Visual feature | What it produces | Result property |
|---|---|---|
| Caption | One human-readable sentence for the whole image | caption.text, caption.confidence |
| DenseCaptions | Up to 10 region captions, each with a bounding box | dense_captions.list[] |
| Tags | Content tags with confidence (no boxes) | tags.list[] |
| Objects | Object class + bounding box | objects.list[].tags, .bounding_box |
| People | People bounding boxes + confidence | people.list[] |
| SmartCrops | Crop coordinates per requested aspect ratio | smart_crops.list[] |
| Read | Printed/handwritten text (OCR) | read.blocks[].lines[].words |
Region and Input Limits (high-value trap)
- Caption and DenseCaptions are GA only in a subset of regions (East US, West US, West Europe, France Central, Korea Central, North Europe, Southeast Asia, West US 2, East Asia, Switzerland North, Sweden Central, and a few more). Tags, Objects, People, SmartCrops, and Read run in all Vision regions. If a question shows a 400/feature-unsupported error after adding Caption, the fix is to deploy the resource in a supported region — not to change the SDK.
- Image bytes must be 50x50 to 16000x16000 pixels and at most 20 MB.
- Set
gender_neutral_caption=Trueto replace gendered nouns ("man", "woman") with "person" — a Responsible AI default many teams require. smart_crops_aspect_ratiosaccepts values from 0.75 to 1.8 (e.g., 0.9 for portrait thumbnails, 1.33 for 4:3). Omit it and the service picks the best single crop.
Calling the API
REST — features are a comma-separated query string and the body carries the image:
POST {endpoint}/computervision/imageanalysis:analyze?api-version=2024-02-01&features=caption,read,tags&gender-neutral-caption=true&language=en
Ocp-Apim-Subscription-Key: <key>
Content-Type: application/json
{ "url": "https://example.com/photo.jpg" }
Python SDK — analyze_from_url (or analyze for bytes) takes a VisualFeatures list:
from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.ai.vision.imageanalysis.models import VisualFeatures
from azure.core.credentials import AzureKeyCredential
client = ImageAnalysisClient(endpoint, AzureKeyCredential(key))
result = client.analyze_from_url(
image_url="https://example.com/photo.jpg",
visual_features=[VisualFeatures.CAPTION, VisualFeatures.TAGS,
VisualFeatures.OBJECTS, VisualFeatures.READ],
gender_neutral_caption=True,
language="en")
print(result.caption.text, result.caption.confidence)
for t in result.tags.list:
print(t.name, t.confidence)
Reading the JSON Response
Each feature lands in its own object; missing features are simply absent. Watch the property names — the exam routinely shows JSON and asks for the line that extracts a value.
{
"captionResult": { "text": "a person walking a dog in a park", "confidence": 0.8745 },
"tagsResult": { "values": [ {"name":"outdoor","confidence":0.99} ] },
"objectsResult": { "values": [ {"tags":[{"name":"dog","confidence":0.92}],
"boundingBox":{"x":250,"y":300,"w":150,"h":120}} ] }
}
Worked Example
A news site needs alt text plus 16:9 and 1:1 social thumbnails from one upload. Request features=caption,smartCrops, set smart_crops_aspect_ratios=[1.78, 1.0], set gender_neutral_caption=True, and store caption.text as alt text and each smart_crops.list[].bounding_box as a crop rectangle — one call, no separate cropping service.
Tags, Objects, and People: Subtle Distinctions
Candidates lose points by conflating the three detection features, because their outputs overlap in everyday language but differ sharply in the response. Tags are image-level keywords with confidence and no spatial location — the model may tag an image "dog" even when the dog is tiny in a corner. Objects return the same class names but each comes with a bounding_box, so you know where the dog is. People is a specialized detector that returns only person bounding boxes with confidence and is the correct choice when you must count or locate humans specifically rather than enumerate every object class.
A common scenario asks how to count shoppers in a frame for analytics: the right feature is People, not Objects filtered to "person," because People is tuned for crowded scenes and partial occlusion.
Choosing Image Analysis vs. Other Vision Services
Image Analysis 4.0 is the prebuilt, no-training option. Reach for Custom Vision or custom Image Analysis models only when the standard tag and object taxonomy cannot name your domain concepts (for example, "cracked weld" or a specific SKU). Reach for Document Intelligence when the picture is really a document and you need fields rather than tags. The exam frequently frames this as a single-best-answer choice: if the requirement is "describe arbitrary photos and pull any visible text with zero training," the answer is Image Analysis 4.0 with Caption + Read, not a trained model.
Custom Image Analysis Models
Image Analysis 4.0 also trains custom classifiers and detectors using Florence transfer learning, so they need far less data than training from scratch.
| Step | Detail |
|---|---|
| Dataset | Create in Vision Studio or via API; COCO-format annotations for detection |
| Minimum images | 2 per label (absolute); 15+ recommended for reliable accuracy |
| Train | Service-managed; you set a training budget (hours) |
| Evaluate | Precision, recall, and mAP reported per label |
| Use | Pass model_name="my-model" to analyze instead of standard features |
result = client.analyze_from_url(image_url=url,
visual_features=[VisualFeatures.TAGS], model_name="product-classifier")
On the Exam: Memorize which features are region-limited (Caption, DenseCaptions), the 20 MB / 16000px input ceiling, and that
gender_neutral_captionis the Responsible AI knob. Custom-model questions hinge on the 2-image minimum and that a custom call replaces the visual-features list withmodel_name.
Which foundation model powers Azure AI Vision Image Analysis 4.0?
A developer adds Caption to an Image Analysis call and starts receiving a feature-not-supported error, while Tags and Read still work. What is the most likely cause?
Which visual feature generates natural-language captions for multiple regions within a single image, each with its own bounding box?
What is the minimum number of labeled images required per tag to train a custom Image Analysis 4.0 model?