Which foundation model powers Azure AI Vision Image Analysis 4.0?

Florence. Image Analysis 4.0 is powered by the Florence foundation model, a multimodal model that understands both images and text. This enables richer analysis including natural-language captioning, dense captions, and improved tagging accuracy.

Which visual feature generates natural-language captions for multiple regions within an image?

DenseCaptions. DenseCaptions generates captions for multiple regions within an image, each with its own bounding box. Caption generates a single caption for the entire image. Tags provides keyword labels, and Objects provides bounding boxes with class labels.

In the Image Analysis 4.0 response, where is the OCR text content found?

readResult.blocks[].lines[].text. OCR text from the Read visual feature is found in the readResult, organized as blocks → lines → text. Each line includes the text content and bounding polygon coordinates.

What is the minimum number of labeled images required per class to train a custom Image Analysis classification model?

2 images. The minimum is 2 images per class for custom Image Analysis models, though Microsoft recommends at least 15 images per class for better accuracy. The model uses transfer learning from the Florence foundation model, so it can learn from relatively few examples.

Azure AI Vision — Image Analysis 4.0

Quick Answer: Image Analysis 4.0 uses the Florence foundation model to provide captioning, dense captions, tags, object detection, people detection, smart crops, and OCR in a single API call. Specify visual features as parameters to control which analyses are performed.

Image Analysis 4.0 Visual Features

Feature	Description	Output
Caption	Generate a single natural-language description of the image	"A person walking a dog in a park"
DenseCaptions	Generate captions for multiple regions within the image	Multiple captions with bounding boxes
Tags	Identify content tags with confidence scores	["outdoor", "dog", "person", "park", "grass"]
Objects	Detect objects with bounding boxes and labels	Object name + bounding box coordinates
People	Detect people with bounding boxes	Person bounding box + confidence score
SmartCrops	Suggest crop regions for different aspect ratios	Crop coordinates for specified aspect ratios
Read	Extract printed and handwritten text (OCR)	Text lines with bounding polygons

Using the Image Analysis SDK (Python)

from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.ai.vision.imageanalysis.models import VisualFeatures
from azure.core.credentials import AzureKeyCredential

client = ImageAnalysisClient(
    endpoint="https://my-vision.cognitiveservices.azure.com/",
    credential=AzureKeyCredential("<your-key>")
)

# Analyze an image with multiple features
result = client.analyze_from_url(
    image_url="https://example.com/photo.jpg",
    visual_features=[
        VisualFeatures.CAPTION,
        VisualFeatures.DENSE_CAPTIONS,
        VisualFeatures.TAGS,
        VisualFeatures.OBJECTS,
        VisualFeatures.PEOPLE,
        VisualFeatures.SMART_CROPS,
        VisualFeatures.READ
    ],
    gender_neutral_caption=True,  # Use gender-neutral language
    smart_crops_aspect_ratios=[0.9, 1.33],  # 9:10 and 4:3 aspect ratios
    language="en"
)

# Access caption
print(f"Caption: {result.caption.text}")
print(f"Confidence: {result.caption.confidence:.2f}")

# Access tags
for tag in result.tags.list:
    print(f"Tag: {tag.name} (confidence: {tag.confidence:.2f})")

# Access detected objects
for obj in result.objects.list:
    print(f"Object: {obj.tags[0].name}")
    print(f"  Bounding box: {obj.bounding_box}")

# Access OCR text
if result.read:
    for block in result.read.blocks:
        for line in block.lines:
            print(f"Text: {line.text}")

Using the REST API

POST https://my-vision.cognitiveservices.azure.com/computervision/imageanalysis:analyze?api-version=2024-02-01&features=caption,tags,objects

Headers:
    Ocp-Apim-Subscription-Key: <your-key>
    Content-Type: application/json

Body:
{
    "url": "https://example.com/photo.jpg"
}

Image Analysis Response Structure

{
    "captionResult": {
        "text": "A person walking a dog in a sunny park",
        "confidence": 0.8745
    },
    "tagsResult": {
        "values": [
            {"name": "outdoor", "confidence": 0.9912},
            {"name": "person", "confidence": 0.9834},
            {"name": "dog", "confidence": 0.9756},
            {"name": "park", "confidence": 0.8921}
        ]
    },
    "objectsResult": {
        "values": [
            {
                "tags": [{"name": "person", "confidence": 0.95}],
                "boundingBox": {"x": 100, "y": 50, "w": 200, "h": 400}
            },
            {
                "tags": [{"name": "dog", "confidence": 0.92}],
                "boundingBox": {"x": 250, "y": 300, "w": 150, "h": 120}
            }
        ]
    }
}

On the Exam: Know the parameter names for visual features (Caption, DenseCaptions, Tags, Objects, People, SmartCrops, Read) and the response structure. Questions may show a JSON response and ask you to identify the correct code to extract specific values.

Custom Image Analysis Models

Image Analysis 4.0 supports training custom models for domain-specific tasks:

Custom Classification

Train a model to classify images into your custom categories
Provide labeled images for each category (minimum 2 images per class, recommended 15+)
Model training uses transfer learning from the Florence foundation model

Custom Object Detection

Train a model to detect and locate your custom objects
Provide labeled images with bounding box annotations
Outputs object locations with confidence scores

Training Workflow

Create a training dataset in Azure AI Vision Studio or via API
Upload and label images (classification labels or bounding boxes)
Train the model (training is managed by the service)
Evaluate model performance (precision, recall, mAP)
Publish the model to an endpoint
Call the endpoint with the custom model name

Using a Custom Model

result = client.analyze_from_url(
    image_url="https://example.com/product.jpg",
    visual_features=[VisualFeatures.TAGS],
    model_name="my-custom-product-classifier"
)

Azure AI Engineer Associate

3.1 Azure AI Vision — Image Analysis 4.0

Key Takeaways

Azure AI Vision — Image Analysis 4.0

Image Analysis 4.0 Visual Features

Using the Image Analysis SDK (Python)

Using the REST API

Image Analysis Response Structure

Custom Image Analysis Models

Custom Classification

Custom Object Detection

Training Workflow

Using a Custom Model

Azure AI Engineer Associate

1Introduction

2Domain 1: Plan and Manage an Azure AI Solution (15-20%)

3Domain 2: Implement Content Moderation Solutions (10-15%)

4Domain 3: Implement Computer Vision Solutions (15-20%)

5Domain 4: Implement Natural Language Processing Solutions (25-30%)

6Domain 5: Implement Knowledge Mining and Document Intelligence Solutions (10-15%)

7Domain 6: Implement Generative AI Solutions (10-15%)

8Exam Review: Cross-Domain Topics and Advanced Practice

3.1 Azure AI Vision — Image Analysis 4.0

Key Takeaways

Azure AI Vision — Image Analysis 4.0

Image Analysis 4.0 Visual Features

Using the Image Analysis SDK (Python)

Using the REST API

Image Analysis Response Structure

Custom Image Analysis Models

Custom Classification

Custom Object Detection

Training Workflow

Using a Custom Model