3.1 Computer Vision Concepts and Tasks
Key Takeaways
- Computer vision enables AI to interpret and extract information from images and videos — it is how machines "see."
- Image classification assigns a single label to an entire image (e.g., "cat" or "dog"); it answers "What is in this image?"
- Object detection identifies AND locates objects within an image using bounding boxes — it answers "Where are objects in this image?"
- Semantic segmentation classifies every pixel in an image, creating a detailed map of what is where — used in autonomous driving.
- OCR (Optical Character Recognition) extracts text from images, converting visual text into machine-readable text.
Computer Vision Concepts and Tasks
Quick Answer: Computer vision enables machines to interpret images and videos. The key tasks are image classification (label an entire image), object detection (find and locate objects with bounding boxes), semantic segmentation (classify every pixel), OCR (extract text from images), and facial analysis (detect and analyze faces). Each task answers a different question about the visual content.
What Is Computer Vision?
Computer vision is a field of AI that enables computers to interpret and extract meaningful information from visual data — images and videos. It is how machines "see" and understand the visual world.
Computer vision powers applications you use every day:
- Unlocking your phone with your face
- Depositing checks by photographing them
- Self-driving cars navigating roads
- Security cameras detecting intruders
- Medical imaging detecting tumors
- Barcode and QR code scanning
Core Computer Vision Tasks
Understanding the differences between these tasks is critical for the AI-900:
1. Image Classification
What it does: Assigns a single label to an entire image.
Question it answers: "What is in this image?"
Output: A class label with a confidence score.
| Input Image | Predicted Label | Confidence |
|---|---|---|
| Photo of a golden retriever | "Dog" | 95.2% |
| Photo of a tabby cat | "Cat" | 98.7% |
| Chest X-ray | "Pneumonia" | 87.3% |
| Satellite image | "Urban" | 91.5% |
Use cases:
- Medical image screening (disease vs. healthy)
- Quality inspection in manufacturing (defective vs. acceptable)
- Content moderation (appropriate vs. inappropriate)
- Wildlife monitoring (species identification)
- Plant disease detection (healthy vs. infected)
2. Object Detection
What it does: Identifies AND locates multiple objects within an image using bounding boxes.
Question it answers: "What objects are in this image, and where are they?"
Output: A list of detected objects, each with a class label, confidence score, and bounding box coordinates (x, y, width, height).
| Input Image | Detected Objects |
|---|---|
| Street scene | Car (0.97) at [120, 200, 300, 180], Person (0.93) at [450, 150, 80, 200], Dog (0.85) at [600, 350, 60, 50] |
| Grocery shelf | Apple (0.99) at [50, 100, 80, 80], Banana (0.96) at [200, 100, 120, 40] |
Use cases:
- Autonomous vehicles (detect pedestrians, cars, signs)
- Retail analytics (count products on shelves)
- Security and surveillance (detect people, vehicles)
- Medical imaging (locate tumors, lesions)
- Wildlife monitoring (count and locate animals)
3. Semantic Segmentation
What it does: Classifies every pixel in an image, creating a detailed map of the scene.
Question it answers: "What class does every pixel in this image belong to?"
Output: A pixel-by-pixel classification map where each pixel is assigned a category.
Use cases:
- Autonomous driving (road, sidewalk, car, pedestrian, building — for every pixel)
- Medical imaging (precise tumor boundaries)
- Land use mapping (forest, water, urban, agricultural — from satellite images)
- Augmented reality (separate foreground from background)
4. OCR (Optical Character Recognition)
What it does: Extracts text from images, converting visual text into machine-readable text.
Question it answers: "What text is in this image?"
Output: Extracted text with position information.
Use cases:
- Reading documents (contracts, forms, letters)
- License plate recognition
- Receipt and invoice digitization
- Sign and label reading
- Handwriting recognition
- Business card scanning
5. Facial Detection and Analysis
What it does: Detects human faces in images and optionally analyzes attributes.
Question it answers: "Are there faces in this image, and what can you tell about them?"
Output: Face locations and optional attributes.
| Attribute | Description |
|---|---|
| Face location | Bounding box around each face |
| Age | Estimated age |
| Emotion | Happiness, sadness, anger, surprise, fear, contempt, disgust, neutral |
| Head pose | Rotation angles (pitch, roll, yaw) |
| Glasses | Whether the person wears glasses |
| Facial landmarks | Key points (eyes, nose, mouth, jaw) |
Important: Microsoft has retired facial recognition capabilities that infer emotional state, gender, and age from the Face API due to responsible AI concerns. The current Face API focuses on face detection, verification, and identification — not emotion or demographic inference.
Comparison Table: Key Computer Vision Tasks
| Task | Input | Output | Question Answered |
|---|---|---|---|
| Image Classification | Whole image | Single label + confidence | "What is this image?" |
| Object Detection | Whole image | Multiple objects + bounding boxes | "Where are objects in this image?" |
| Semantic Segmentation | Whole image | Per-pixel classification map | "What is every pixel in this image?" |
| OCR | Image with text | Extracted text + positions | "What text is in this image?" |
| Facial Detection | Image with faces | Face locations + attributes | "Where are faces and what do they look like?" |
On the Exam: The most common question type is: "A company needs to [scenario]. Which computer vision task should they use?" Focus on the output: single label = classification, located objects = detection, pixel map = segmentation, extracted text = OCR, face analysis = facial detection.
A retail company wants to count the number of each product type on store shelves by identifying and locating products in photographs. Which computer vision task should they use?
Which computer vision task classifies every pixel in an image?
A bank wants to automatically read account numbers and amounts from paper checks that customers photograph with their phones. Which computer vision task is needed?
What is the key difference between image classification and object detection?
Match each computer vision task to its primary output:
Match each item on the left with the correct item on the right