3.5 Video Analysis and Spatial Analysis

Key Takeaways

Azure AI Video Indexer extracts audio insights (transcription, translation, speaker ID, sentiment, topics) and visual insights (faces, scenes, shots, OCR, objects, brands) from uploaded video.
Video Indexer is asynchronous: upload returns a video id, then you poll the Index endpoint until processing completes before reading insights JSON.
Spatial Analysis runs as a Docker container on Azure IoT Edge, processing camera RTSP streams locally and sending only aggregated events to the cloud.
Spatial Analysis operations include PersonCount, PersonCrossingLine, PersonCrossingPolygon, PersonDistance, and PersonZoneDwellTime, configured with zones or lines in normalized 0-1 coordinates.
Spatial Analysis is privacy-by-design: it represents people as bounding boxes, performs no facial recognition, and discards frames after processing.

Last updated: June 2026

Quick Answer: Video Indexer mines uploaded video for transcripts, faces, scenes, topics, brands, and on-screen text — asynchronously. Spatial Analysis analyzes live camera feeds at the edge to count people, detect line/zone crossings, measure distance, and track dwell time. Spatial Analysis runs as an IoT Edge container and does no facial recognition.

Azure AI Video Indexer

Video Indexer is a multi-model pipeline that produces a single rich insights JSON. Insights split into audio and visual.

Audio insights

Insight	Description
Transcription	Speech-to-text of all spoken content
Translation	Translate the transcript to other languages
Speaker identification	Distinguish individual speakers
Sentiment	Positive/neutral/negative tone over time
Audio effects	Clapping, silence, crowd noise
Topics & keywords	Discussion topics from the transcript

Visual insights

Insight	Description
Face detection	Locate (and optionally name) faces over time
Scene & shot detection	Segment boundaries and camera cuts
OCR	Text burned into frames (signs, captions)
Object & brand detection	Recognize objects and logos
Thumbnail extraction	Representative keyframes

Asynchronous workflow and access tokens

Video Indexer is not synchronous. You first obtain an access token from the Authorization API (Account, User, Project, or Video scope; tokens expire after about an hour). You then upload, receive a video id, and poll the Index endpoint until state == "Processed" before reading insights.

import requests
up = (f"https://api.videoindexer.ai/{location}/Accounts/{account_id}/Videos"
      f"?name=demo&privacy=Private&accessToken={token}")
video_id = requests.post(up, files={"file": open("video.mp4", "rb")}).json()["id"]

idx = (f"https://api.videoindexer.ai/{location}/Accounts/{account_id}"
       f"/Videos/{video_id}/Index?accessToken={token}")
insights = requests.get(idx).json()  # poll until state == 'Processed'
for seg in insights["videos"][0]["insights"]["transcript"]:
    print(seg["text"])

On the Exam: Video Indexer = batch/async insight mining of recorded files; expect a token + upload + poll flow. Do not confuse it with Spatial Analysis, which is real-time on live streams.

Spatial Analysis

Spatial Analysis is part of Azure AI Vision and understands how people move through a physical space using existing security cameras. It is delivered as a Docker container you deploy to an Azure IoT Edge device with a capable GPU.

The five Person operations

Operation	What it detects	Example use
PersonCount	People currently in a zone	Live store occupancy vs. a cap
PersonCrossingLine	A person crossing a virtual line	Entry/exit footfall counting
PersonCrossingPolygon	Entering/leaving a polygon zone	Restricted-area intrusion alerts
PersonDistance	Distance between people	Social-distancing compliance
PersonZoneDwellTime	How long a person stays in a zone	Queue wait time, dwell analytics

Zones and lines are defined in normalized 0-to-1 coordinates (relative to frame size), so configs survive a resolution change. focus (e.g., footprint vs. center) controls which body point triggers the event, and threshold tunes detection confidence.

{
  "type": "cognitiveservices.vision.spatialanalysis-personcrossingline",
  "input": { "source": { "type": "rtsp", "uri": "rtsp://cam:554/stream" } },
  "parameters": {
    "lines": [{ "name": "entrance",
      "start": {"x": 0.1, "y": 0.5}, "end": {"x": 0.9, "y": 0.5} }],
    "threshold": 16, "focus": "footprint"
  }
}

Deployment architecture

[IP Camera (RTSP)] -> [Azure IoT Edge device]
    -> [Spatial Analysis container]
        -> processes frames LOCALLY (no upload)
        -> emits aggregated events
            -> [Azure IoT Hub] -> [Stream Analytics] -> [Dashboard / Power BI]

Privacy by design

No facial recognition — people are anonymous bounding boxes, never identified.
No frame storage — frames are processed in memory and discarded.
Edge processing — raw video never leaves the device; only counts/events reach the cloud.
Scoped zones — monitor only specific areas, not the full field of view.

Choosing Between the Two

Need	Service
Transcript, topics, faces from a recorded MP4	Video Indexer
Real-time people counting / line crossings	Spatial Analysis
Process on-prem with no video leaving site	Spatial Analysis (IoT Edge)
Searchable index of a media library	Video Indexer

Worked Example

A retailer must enforce a live occupancy cap and measure checkout queue waits without sending customer video to the cloud. Deploy the Spatial Analysis container to an IoT Edge gateway, run PersonCount on the store-floor zone for occupancy and PersonZoneDwellTime on a polygon around the tills for wait time. Only the aggregated numbers flow to IoT Hub, satisfying privacy requirements because no faces are recognized and no frames are stored.

Video Indexer Deployment Modes and Customization

Video Indexer comes in two flavors the exam may contrast. The classic / connected account runs fully in Azure and stores media in a Microsoft-managed account. Azure Resource Manager (ARM) accounts integrate with your own Azure Storage and support managed identity and customer-managed keys for stricter governance — pick ARM when a requirement mentions bringing your own storage or enterprise security controls. Video Indexer is also customizable: you can train a person model to name recurring faces, a brand model to track brands relevant to your business, and a language model to bias the transcript toward domain vocabulary.

When a scenario asks how to make transcripts recognize industry jargon, the answer is a custom language model, not switching services.

Real-Time vs. Batch: The Core Distinction

The single most important decision in this section is matching the workload to the service. Video Indexer is batch: it ingests recorded files and produces a searchable index minutes later, so it suits media libraries, compliance review, and content discovery. Spatial Analysis is real-time: it consumes a live RTSP feed and emits events as people move, so it suits occupancy enforcement, safety alerts, and queue management. A scenario that says "alert security the moment someone enters a restricted zone" demands Spatial Analysis with PersonCrossingPolygon, because Video Indexer's after-the-fact indexing cannot fire a live alert.

On the Exam: Spatial Analysis runs on IoT Edge, not the cloud, performs no facial recognition, and uses normalized 0-1 zone/line coordinates. Dwell-time questions map to PersonZoneDwellTime; entry/exit counting maps to PersonCrossingLine; restricted-area intrusion maps to PersonCrossingPolygon. Video Indexer = batch insight mining; custom language model adds domain vocabulary to transcripts.

Test Your Knowledge

Where does Azure AI Vision Spatial Analysis process video streams?

In the Azure cloud region of the resource

On an Azure IoT Edge device at the edge, in a Docker container

On the IP camera firmware itself

Inside Azure AI Foundry

Test Your Knowledge

Which Spatial Analysis operation measures how long customers wait in a checkout queue?

PersonCount

PersonCrossingLine

PersonZoneDwellTime

PersonDistance

Test Your Knowledge

A team must build a searchable index of recorded conference talks, including transcripts, speakers, topics, and on-screen text. Which service should they use?

Spatial Analysis on IoT Edge

Custom Vision object detection

The Face identification API

Azure AI Video Indexer

Test Your Knowledge

Why is Azure AI Vision Spatial Analysis considered privacy-by-design?

It encrypts every video frame and stores it for audit

It identifies each person by face but anonymizes the logs

It represents people as bounding boxes, performs no facial recognition, and discards frames after processing

It uploads all video to the cloud where access is restricted

Up Next

4.1 Azure AI Language — Text Analytics Features

Domain 5: Implement Natural Language Processing Solutions (15-20%)

Azure AI Engineer Associate

Azure AI-102

3.5 Video Analysis and Spatial Analysis

Key Takeaways

Azure AI Video Indexer

Audio insights

Visual insights

Asynchronous workflow and access tokens

Spatial Analysis

The five Person operations

Deployment architecture

Privacy by design

Choosing Between the Two

Worked Example

Video Indexer Deployment Modes and Customization

Real-Time vs. Batch: The Core Distinction

Azure AI Engineer Associate

1Introduction

2Domain 1: Plan and Manage an Azure AI Solution (20-25%)

3Content Safety and Moderation (within Plan and Manage, Domain 1)

4Domain 4: Implement Computer Vision Solutions (10-15%)

5Domain 5: Implement Natural Language Processing Solutions (15-20%)

6Domain 6: Implement Knowledge Mining and Information Extraction Solutions (15-20%)

7Domain 2: Implement Generative AI Solutions (15-20%)

8Domain 3: Implement an Agentic Solution (5-10%)

9Exam Review: Cross-Domain Topics and Advanced Practice

Azure AI-102

3.5 Video Analysis and Spatial Analysis

Key Takeaways

Azure AI Video Indexer

Audio insights

Visual insights

Asynchronous workflow and access tokens

Spatial Analysis

The five Person operations

Deployment architecture

Privacy by design

Choosing Between the Two

Worked Example

Video Indexer Deployment Modes and Customization

Real-Time vs. Batch: The Core Distinction