3.1 Azure AI Vision — Image Analysis 4.0

Key Takeaways

  • Image Analysis 4.0 is the latest API that provides image captioning, tagging, object detection, smart cropping, people detection, and OCR in a single Analyze call.
  • The API uses a multimodal Florence foundation model that understands both images and text, providing richer and more accurate results than previous versions.
  • Visual features are specified as parameters in the API call: Caption, DenseCaptions, Tags, Objects, People, SmartCrops, Read (OCR).
  • Image Analysis 4.0 supports both URL-based and binary image input, with images up to 20 MB in size.
  • Custom models can be trained using Image Analysis 4.0 with your own labeled data for domain-specific classification and object detection.
Last updated: March 2026

Azure AI Vision — Image Analysis 4.0

Quick Answer: Image Analysis 4.0 uses the Florence foundation model to provide captioning, dense captions, tags, object detection, people detection, smart crops, and OCR in a single API call. Specify visual features as parameters to control which analyses are performed.

Image Analysis 4.0 Visual Features

FeatureDescriptionOutput
CaptionGenerate a single natural-language description of the image"A person walking a dog in a park"
DenseCaptionsGenerate captions for multiple regions within the imageMultiple captions with bounding boxes
TagsIdentify content tags with confidence scores["outdoor", "dog", "person", "park", "grass"]
ObjectsDetect objects with bounding boxes and labelsObject name + bounding box coordinates
PeopleDetect people with bounding boxesPerson bounding box + confidence score
SmartCropsSuggest crop regions for different aspect ratiosCrop coordinates for specified aspect ratios
ReadExtract printed and handwritten text (OCR)Text lines with bounding polygons

Using the Image Analysis SDK (Python)

from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.ai.vision.imageanalysis.models import VisualFeatures
from azure.core.credentials import AzureKeyCredential

client = ImageAnalysisClient(
    endpoint="https://my-vision.cognitiveservices.azure.com/",
    credential=AzureKeyCredential("<your-key>")
)

# Analyze an image with multiple features
result = client.analyze_from_url(
    image_url="https://example.com/photo.jpg",
    visual_features=[
        VisualFeatures.CAPTION,
        VisualFeatures.DENSE_CAPTIONS,
        VisualFeatures.TAGS,
        VisualFeatures.OBJECTS,
        VisualFeatures.PEOPLE,
        VisualFeatures.SMART_CROPS,
        VisualFeatures.READ
    ],
    gender_neutral_caption=True,  # Use gender-neutral language
    smart_crops_aspect_ratios=[0.9, 1.33],  # 9:10 and 4:3 aspect ratios
    language="en"
)

# Access caption
print(f"Caption: {result.caption.text}")
print(f"Confidence: {result.caption.confidence:.2f}")

# Access tags
for tag in result.tags.list:
    print(f"Tag: {tag.name} (confidence: {tag.confidence:.2f})")

# Access detected objects
for obj in result.objects.list:
    print(f"Object: {obj.tags[0].name}")
    print(f"  Bounding box: {obj.bounding_box}")

# Access OCR text
if result.read:
    for block in result.read.blocks:
        for line in block.lines:
            print(f"Text: {line.text}")

Using the REST API

POST https://my-vision.cognitiveservices.azure.com/computervision/imageanalysis:analyze?api-version=2024-02-01&features=caption,tags,objects

Headers:
    Ocp-Apim-Subscription-Key: <your-key>
    Content-Type: application/json

Body:
{
    "url": "https://example.com/photo.jpg"
}

Image Analysis Response Structure

{
    "captionResult": {
        "text": "A person walking a dog in a sunny park",
        "confidence": 0.8745
    },
    "tagsResult": {
        "values": [
            {"name": "outdoor", "confidence": 0.9912},
            {"name": "person", "confidence": 0.9834},
            {"name": "dog", "confidence": 0.9756},
            {"name": "park", "confidence": 0.8921}
        ]
    },
    "objectsResult": {
        "values": [
            {
                "tags": [{"name": "person", "confidence": 0.95}],
                "boundingBox": {"x": 100, "y": 50, "w": 200, "h": 400}
            },
            {
                "tags": [{"name": "dog", "confidence": 0.92}],
                "boundingBox": {"x": 250, "y": 300, "w": 150, "h": 120}
            }
        ]
    }
}

On the Exam: Know the parameter names for visual features (Caption, DenseCaptions, Tags, Objects, People, SmartCrops, Read) and the response structure. Questions may show a JSON response and ask you to identify the correct code to extract specific values.

Custom Image Analysis Models

Image Analysis 4.0 supports training custom models for domain-specific tasks:

Custom Classification

  • Train a model to classify images into your custom categories
  • Provide labeled images for each category (minimum 2 images per class, recommended 15+)
  • Model training uses transfer learning from the Florence foundation model

Custom Object Detection

  • Train a model to detect and locate your custom objects
  • Provide labeled images with bounding box annotations
  • Outputs object locations with confidence scores

Training Workflow

  1. Create a training dataset in Azure AI Vision Studio or via API
  2. Upload and label images (classification labels or bounding boxes)
  3. Train the model (training is managed by the service)
  4. Evaluate model performance (precision, recall, mAP)
  5. Publish the model to an endpoint
  6. Call the endpoint with the custom model name

Using a Custom Model

result = client.analyze_from_url(
    image_url="https://example.com/product.jpg",
    visual_features=[VisualFeatures.TAGS],
    model_name="my-custom-product-classifier"
)
Test Your Knowledge

Which foundation model powers Azure AI Vision Image Analysis 4.0?

A
B
C
D
Test Your Knowledge

Which visual feature generates natural-language captions for multiple regions within an image?

A
B
C
D
Test Your Knowledge

In the Image Analysis 4.0 response, where is the OCR text content found?

A
B
C
D
Test Your Knowledge

What is the minimum number of labeled images required per class to train a custom Image Analysis classification model?

A
B
C
D