3.1 Azure AI Vision — Image Analysis 4.0
Key Takeaways
- Image Analysis 4.0 is the latest API that provides image captioning, tagging, object detection, smart cropping, people detection, and OCR in a single Analyze call.
- The API uses a multimodal Florence foundation model that understands both images and text, providing richer and more accurate results than previous versions.
- Visual features are specified as parameters in the API call: Caption, DenseCaptions, Tags, Objects, People, SmartCrops, Read (OCR).
- Image Analysis 4.0 supports both URL-based and binary image input, with images up to 20 MB in size.
- Custom models can be trained using Image Analysis 4.0 with your own labeled data for domain-specific classification and object detection.
Azure AI Vision — Image Analysis 4.0
Quick Answer: Image Analysis 4.0 uses the Florence foundation model to provide captioning, dense captions, tags, object detection, people detection, smart crops, and OCR in a single API call. Specify visual features as parameters to control which analyses are performed.
Image Analysis 4.0 Visual Features
| Feature | Description | Output |
|---|---|---|
| Caption | Generate a single natural-language description of the image | "A person walking a dog in a park" |
| DenseCaptions | Generate captions for multiple regions within the image | Multiple captions with bounding boxes |
| Tags | Identify content tags with confidence scores | ["outdoor", "dog", "person", "park", "grass"] |
| Objects | Detect objects with bounding boxes and labels | Object name + bounding box coordinates |
| People | Detect people with bounding boxes | Person bounding box + confidence score |
| SmartCrops | Suggest crop regions for different aspect ratios | Crop coordinates for specified aspect ratios |
| Read | Extract printed and handwritten text (OCR) | Text lines with bounding polygons |
Using the Image Analysis SDK (Python)
from azure.ai.vision.imageanalysis import ImageAnalysisClient
from azure.ai.vision.imageanalysis.models import VisualFeatures
from azure.core.credentials import AzureKeyCredential
client = ImageAnalysisClient(
endpoint="https://my-vision.cognitiveservices.azure.com/",
credential=AzureKeyCredential("<your-key>")
)
# Analyze an image with multiple features
result = client.analyze_from_url(
image_url="https://example.com/photo.jpg",
visual_features=[
VisualFeatures.CAPTION,
VisualFeatures.DENSE_CAPTIONS,
VisualFeatures.TAGS,
VisualFeatures.OBJECTS,
VisualFeatures.PEOPLE,
VisualFeatures.SMART_CROPS,
VisualFeatures.READ
],
gender_neutral_caption=True, # Use gender-neutral language
smart_crops_aspect_ratios=[0.9, 1.33], # 9:10 and 4:3 aspect ratios
language="en"
)
# Access caption
print(f"Caption: {result.caption.text}")
print(f"Confidence: {result.caption.confidence:.2f}")
# Access tags
for tag in result.tags.list:
print(f"Tag: {tag.name} (confidence: {tag.confidence:.2f})")
# Access detected objects
for obj in result.objects.list:
print(f"Object: {obj.tags[0].name}")
print(f" Bounding box: {obj.bounding_box}")
# Access OCR text
if result.read:
for block in result.read.blocks:
for line in block.lines:
print(f"Text: {line.text}")
Using the REST API
POST https://my-vision.cognitiveservices.azure.com/computervision/imageanalysis:analyze?api-version=2024-02-01&features=caption,tags,objects
Headers:
Ocp-Apim-Subscription-Key: <your-key>
Content-Type: application/json
Body:
{
"url": "https://example.com/photo.jpg"
}
Image Analysis Response Structure
{
"captionResult": {
"text": "A person walking a dog in a sunny park",
"confidence": 0.8745
},
"tagsResult": {
"values": [
{"name": "outdoor", "confidence": 0.9912},
{"name": "person", "confidence": 0.9834},
{"name": "dog", "confidence": 0.9756},
{"name": "park", "confidence": 0.8921}
]
},
"objectsResult": {
"values": [
{
"tags": [{"name": "person", "confidence": 0.95}],
"boundingBox": {"x": 100, "y": 50, "w": 200, "h": 400}
},
{
"tags": [{"name": "dog", "confidence": 0.92}],
"boundingBox": {"x": 250, "y": 300, "w": 150, "h": 120}
}
]
}
}
On the Exam: Know the parameter names for visual features (Caption, DenseCaptions, Tags, Objects, People, SmartCrops, Read) and the response structure. Questions may show a JSON response and ask you to identify the correct code to extract specific values.
Custom Image Analysis Models
Image Analysis 4.0 supports training custom models for domain-specific tasks:
Custom Classification
- Train a model to classify images into your custom categories
- Provide labeled images for each category (minimum 2 images per class, recommended 15+)
- Model training uses transfer learning from the Florence foundation model
Custom Object Detection
- Train a model to detect and locate your custom objects
- Provide labeled images with bounding box annotations
- Outputs object locations with confidence scores
Training Workflow
- Create a training dataset in Azure AI Vision Studio or via API
- Upload and label images (classification labels or bounding boxes)
- Train the model (training is managed by the service)
- Evaluate model performance (precision, recall, mAP)
- Publish the model to an endpoint
- Call the endpoint with the custom model name
Using a Custom Model
result = client.analyze_from_url(
image_url="https://example.com/product.jpg",
visual_features=[VisualFeatures.TAGS],
model_name="my-custom-product-classifier"
)
Which foundation model powers Azure AI Vision Image Analysis 4.0?
Which visual feature generates natural-language captions for multiple regions within an image?
In the Image Analysis 4.0 response, where is the OCR text content found?
What is the minimum number of labeled images required per class to train a custom Image Analysis classification model?