3.1 Computer Vision Concepts and Tasks

Key Takeaways

  • Computer vision enables AI to interpret and extract information from images and videos — it is how machines "see."
  • Image classification assigns a single label to an entire image (e.g., "cat" or "dog"); it answers "What is in this image?"
  • Object detection identifies AND locates objects within an image using bounding boxes — it answers "Where are objects in this image?"
  • Semantic segmentation classifies every pixel in an image, creating a detailed map of what is where — used in autonomous driving.
  • OCR (Optical Character Recognition) extracts text from images, converting visual text into machine-readable text.
Last updated: March 2026

Computer Vision Concepts and Tasks

Quick Answer: Computer vision enables machines to interpret images and videos. The key tasks are image classification (label an entire image), object detection (find and locate objects with bounding boxes), semantic segmentation (classify every pixel), OCR (extract text from images), and facial analysis (detect and analyze faces). Each task answers a different question about the visual content.

What Is Computer Vision?

Computer vision is a field of AI that enables computers to interpret and extract meaningful information from visual data — images and videos. It is how machines "see" and understand the visual world.

Computer vision powers applications you use every day:

  • Unlocking your phone with your face
  • Depositing checks by photographing them
  • Self-driving cars navigating roads
  • Security cameras detecting intruders
  • Medical imaging detecting tumors
  • Barcode and QR code scanning

Core Computer Vision Tasks

Understanding the differences between these tasks is critical for the AI-900:

1. Image Classification

What it does: Assigns a single label to an entire image.

Question it answers: "What is in this image?"

Output: A class label with a confidence score.

Input ImagePredicted LabelConfidence
Photo of a golden retriever"Dog"95.2%
Photo of a tabby cat"Cat"98.7%
Chest X-ray"Pneumonia"87.3%
Satellite image"Urban"91.5%

Use cases:

  • Medical image screening (disease vs. healthy)
  • Quality inspection in manufacturing (defective vs. acceptable)
  • Content moderation (appropriate vs. inappropriate)
  • Wildlife monitoring (species identification)
  • Plant disease detection (healthy vs. infected)

2. Object Detection

What it does: Identifies AND locates multiple objects within an image using bounding boxes.

Question it answers: "What objects are in this image, and where are they?"

Output: A list of detected objects, each with a class label, confidence score, and bounding box coordinates (x, y, width, height).

Input ImageDetected Objects
Street sceneCar (0.97) at [120, 200, 300, 180], Person (0.93) at [450, 150, 80, 200], Dog (0.85) at [600, 350, 60, 50]
Grocery shelfApple (0.99) at [50, 100, 80, 80], Banana (0.96) at [200, 100, 120, 40]

Use cases:

  • Autonomous vehicles (detect pedestrians, cars, signs)
  • Retail analytics (count products on shelves)
  • Security and surveillance (detect people, vehicles)
  • Medical imaging (locate tumors, lesions)
  • Wildlife monitoring (count and locate animals)

3. Semantic Segmentation

What it does: Classifies every pixel in an image, creating a detailed map of the scene.

Question it answers: "What class does every pixel in this image belong to?"

Output: A pixel-by-pixel classification map where each pixel is assigned a category.

Use cases:

  • Autonomous driving (road, sidewalk, car, pedestrian, building — for every pixel)
  • Medical imaging (precise tumor boundaries)
  • Land use mapping (forest, water, urban, agricultural — from satellite images)
  • Augmented reality (separate foreground from background)

4. OCR (Optical Character Recognition)

What it does: Extracts text from images, converting visual text into machine-readable text.

Question it answers: "What text is in this image?"

Output: Extracted text with position information.

Use cases:

  • Reading documents (contracts, forms, letters)
  • License plate recognition
  • Receipt and invoice digitization
  • Sign and label reading
  • Handwriting recognition
  • Business card scanning

5. Facial Detection and Analysis

What it does: Detects human faces in images and optionally analyzes attributes.

Question it answers: "Are there faces in this image, and what can you tell about them?"

Output: Face locations and optional attributes.

AttributeDescription
Face locationBounding box around each face
AgeEstimated age
EmotionHappiness, sadness, anger, surprise, fear, contempt, disgust, neutral
Head poseRotation angles (pitch, roll, yaw)
GlassesWhether the person wears glasses
Facial landmarksKey points (eyes, nose, mouth, jaw)

Important: Microsoft has retired facial recognition capabilities that infer emotional state, gender, and age from the Face API due to responsible AI concerns. The current Face API focuses on face detection, verification, and identification — not emotion or demographic inference.

Comparison Table: Key Computer Vision Tasks

TaskInputOutputQuestion Answered
Image ClassificationWhole imageSingle label + confidence"What is this image?"
Object DetectionWhole imageMultiple objects + bounding boxes"Where are objects in this image?"
Semantic SegmentationWhole imagePer-pixel classification map"What is every pixel in this image?"
OCRImage with textExtracted text + positions"What text is in this image?"
Facial DetectionImage with facesFace locations + attributes"Where are faces and what do they look like?"

On the Exam: The most common question type is: "A company needs to [scenario]. Which computer vision task should they use?" Focus on the output: single label = classification, located objects = detection, pixel map = segmentation, extracted text = OCR, face analysis = facial detection.

Test Your Knowledge

A retail company wants to count the number of each product type on store shelves by identifying and locating products in photographs. Which computer vision task should they use?

A
B
C
D
Test Your Knowledge

Which computer vision task classifies every pixel in an image?

A
B
C
D
Test Your Knowledge

A bank wants to automatically read account numbers and amounts from paper checks that customers photograph with their phones. Which computer vision task is needed?

A
B
C
D
Test Your Knowledge

What is the key difference between image classification and object detection?

A
B
C
D
Test Your KnowledgeMatching

Match each computer vision task to its primary output:

Match each item on the left with the correct item on the right

1
Image classification
2
Object detection
3
Semantic segmentation
4
OCR
5
Facial detection