A retail company wants to count the number of each product type on store shelves by identifying and locating products in photographs. Which computer vision task should they use?

Object detection. Object detection identifies AND locates multiple objects within an image using bounding boxes. This is ideal for counting products on shelves because you need to know both what products are present and where they are. Image classification only labels the entire image.

Which computer vision task classifies every pixel in an image?

Semantic segmentation. Semantic segmentation classifies every pixel in an image, creating a detailed map where each pixel is assigned a category (e.g., road, sidewalk, car, pedestrian). This is more granular than object detection, which uses bounding boxes.

A bank wants to automatically read account numbers and amounts from paper checks that customers photograph with their phones. Which computer vision task is needed?

OCR (Optical Character Recognition). OCR extracts text from images, converting visual text into machine-readable text. This is exactly what is needed to read account numbers and amounts from photographed checks. Image classification would only tell you "this is a check," not read its contents.

What is the key difference between image classification and object detection?

Image classification assigns one label to the whole image; object detection locates multiple objects with bounding boxes. Image classification assigns a single label to an entire image ("This is a cat"). Object detection identifies and locates multiple objects within the image using bounding boxes ("There is a cat at position X and a dog at position Y"). Detection provides both labels AND locations.

Computer Vision Concepts and Tasks

Quick Answer: Computer vision enables machines to interpret images and videos. The key tasks are image classification (label an entire image), object detection (find and locate objects with bounding boxes), semantic segmentation (classify every pixel), OCR (extract text from images), and facial analysis (detect and analyze faces). Each task answers a different question about the visual content.

What Is Computer Vision?

Computer vision is a field of AI that enables computers to interpret and extract meaningful information from visual data — images and videos. It is how machines "see" and understand the visual world.

Computer vision powers applications you use every day:

Unlocking your phone with your face
Depositing checks by photographing them
Self-driving cars navigating roads
Security cameras detecting intruders
Medical imaging detecting tumors
Barcode and QR code scanning

Core Computer Vision Tasks

Understanding the differences between these tasks is critical for the AI-900:

1. Image Classification

What it does: Assigns a single label to an entire image.

Question it answers: "What is in this image?"

Output: A class label with a confidence score.

Input Image	Predicted Label	Confidence
Photo of a golden retriever	"Dog"	95.2%
Photo of a tabby cat	"Cat"	98.7%
Chest X-ray	"Pneumonia"	87.3%
Satellite image	"Urban"	91.5%

Use cases:

Medical image screening (disease vs. healthy)
Quality inspection in manufacturing (defective vs. acceptable)
Content moderation (appropriate vs. inappropriate)
Wildlife monitoring (species identification)
Plant disease detection (healthy vs. infected)

2. Object Detection

What it does: Identifies AND locates multiple objects within an image using bounding boxes.

Question it answers: "What objects are in this image, and where are they?"

Output: A list of detected objects, each with a class label, confidence score, and bounding box coordinates (x, y, width, height).

Input Image	Detected Objects
Street scene	Car (0.97) at [120, 200, 300, 180], Person (0.93) at [450, 150, 80, 200], Dog (0.85) at [600, 350, 60, 50]
Grocery shelf	Apple (0.99) at [50, 100, 80, 80], Banana (0.96) at [200, 100, 120, 40]

Use cases:

Autonomous vehicles (detect pedestrians, cars, signs)
Retail analytics (count products on shelves)
Security and surveillance (detect people, vehicles)
Medical imaging (locate tumors, lesions)
Wildlife monitoring (count and locate animals)

3. Semantic Segmentation

What it does: Classifies every pixel in an image, creating a detailed map of the scene.

Question it answers: "What class does every pixel in this image belong to?"

Output: A pixel-by-pixel classification map where each pixel is assigned a category.

Use cases:

Autonomous driving (road, sidewalk, car, pedestrian, building — for every pixel)
Medical imaging (precise tumor boundaries)
Land use mapping (forest, water, urban, agricultural — from satellite images)
Augmented reality (separate foreground from background)

4. OCR (Optical Character Recognition)

What it does: Extracts text from images, converting visual text into machine-readable text.

Question it answers: "What text is in this image?"

Output: Extracted text with position information.

Use cases:

Reading documents (contracts, forms, letters)
License plate recognition
Receipt and invoice digitization
Sign and label reading
Handwriting recognition
Business card scanning

5. Facial Detection and Analysis

What it does: Detects human faces in images and optionally analyzes attributes.

Question it answers: "Are there faces in this image, and what can you tell about them?"

Output: Face locations and optional attributes.

Attribute	Description
Face location	Bounding box around each face
Age	Estimated age
Emotion	Happiness, sadness, anger, surprise, fear, contempt, disgust, neutral
Head pose	Rotation angles (pitch, roll, yaw)
Glasses	Whether the person wears glasses
Facial landmarks	Key points (eyes, nose, mouth, jaw)

Important: Microsoft has retired facial recognition capabilities that infer emotional state, gender, and age from the Face API due to responsible AI concerns. The current Face API focuses on face detection, verification, and identification — not emotion or demographic inference.

Comparison Table: Key Computer Vision Tasks

Task	Input	Output	Question Answered
Image Classification	Whole image	Single label + confidence	"What is this image?"
Object Detection	Whole image	Multiple objects + bounding boxes	"Where are objects in this image?"
Semantic Segmentation	Whole image	Per-pixel classification map	"What is every pixel in this image?"
OCR	Image with text	Extracted text + positions	"What text is in this image?"
Facial Detection	Image with faces	Face locations + attributes	"Where are faces and what do they look like?"

On the Exam: The most common question type is: "A company needs to [scenario]. Which computer vision task should they use?" Focus on the output: single label = classification, located objects = detection, pixel map = segmentation, extracted text = OCR, face analysis = facial detection.

Microsoft Azure AI Fundamentals

3.1 Computer Vision Concepts and Tasks

Key Takeaways

Computer Vision Concepts and Tasks

What Is Computer Vision?

Core Computer Vision Tasks

1. Image Classification

2. Object Detection

3. Semantic Segmentation

4. OCR (Optical Character Recognition)

5. Facial Detection and Analysis

Comparison Table: Key Computer Vision Tasks

Microsoft Azure AI Fundamentals

1Introduction

2Domain 1: Describe AI Workloads and Considerations (15-20%)

3Domain 2: Fundamental Principles of Machine Learning on Azure (20-25%)

4Domain 3: Computer Vision Workloads on Azure (15-20%)

5Domain 4: Natural Language Processing Workloads on Azure (15-20%)

6Domain 5: Generative AI Workloads on Azure (15-20%)

7Exam Review and Full-Length Practice Questions

3.1 Computer Vision Concepts and Tasks

Key Takeaways

Computer Vision Concepts and Tasks

What Is Computer Vision?

Core Computer Vision Tasks

1. Image Classification

2. Object Detection

3. Semantic Segmentation

4. OCR (Optical Character Recognition)

5. Facial Detection and Analysis

Comparison Table: Key Computer Vision Tasks