3.5 Video Analysis and Spatial Analysis

Key Takeaways

  • Azure AI Video Indexer extracts audio insights (transcription, translation, speaker ID, sentiment, topics) and visual insights (faces, scenes, shots, OCR, objects, brands) from uploaded video.
  • Video Indexer is asynchronous: upload returns a video id, then you poll the Index endpoint until processing completes before reading insights JSON.
  • Spatial Analysis runs as a Docker container on Azure IoT Edge, processing camera RTSP streams locally and sending only aggregated events to the cloud.
  • Spatial Analysis operations include PersonCount, PersonCrossingLine, PersonCrossingPolygon, PersonDistance, and PersonZoneDwellTime, configured with zones or lines in normalized 0-1 coordinates.
  • Spatial Analysis is privacy-by-design: it represents people as bounding boxes, performs no facial recognition, and discards frames after processing.
Last updated: June 2026

Quick Answer: Video Indexer mines uploaded video for transcripts, faces, scenes, topics, brands, and on-screen text — asynchronously. Spatial Analysis analyzes live camera feeds at the edge to count people, detect line/zone crossings, measure distance, and track dwell time. Spatial Analysis runs as an IoT Edge container and does no facial recognition.

Azure AI Video Indexer

Video Indexer is a multi-model pipeline that produces a single rich insights JSON. Insights split into audio and visual.

Audio insights

InsightDescription
TranscriptionSpeech-to-text of all spoken content
TranslationTranslate the transcript to other languages
Speaker identificationDistinguish individual speakers
SentimentPositive/neutral/negative tone over time
Audio effectsClapping, silence, crowd noise
Topics & keywordsDiscussion topics from the transcript

Visual insights

InsightDescription
Face detectionLocate (and optionally name) faces over time
Scene & shot detectionSegment boundaries and camera cuts
OCRText burned into frames (signs, captions)
Object & brand detectionRecognize objects and logos
Thumbnail extractionRepresentative keyframes

Asynchronous workflow and access tokens

Video Indexer is not synchronous. You first obtain an access token from the Authorization API (Account, User, Project, or Video scope; tokens expire after about an hour). You then upload, receive a video id, and poll the Index endpoint until state == "Processed" before reading insights.

import requests
up = (f"https://api.videoindexer.ai/{location}/Accounts/{account_id}/Videos"
      f"?name=demo&privacy=Private&accessToken={token}")
video_id = requests.post(up, files={"file": open("video.mp4", "rb")}).json()["id"]

idx = (f"https://api.videoindexer.ai/{location}/Accounts/{account_id}"
       f"/Videos/{video_id}/Index?accessToken={token}")
insights = requests.get(idx).json()  # poll until state == 'Processed'
for seg in insights["videos"][0]["insights"]["transcript"]:
    print(seg["text"])

On the Exam: Video Indexer = batch/async insight mining of recorded files; expect a token + upload + poll flow. Do not confuse it with Spatial Analysis, which is real-time on live streams.

Spatial Analysis

Spatial Analysis is part of Azure AI Vision and understands how people move through a physical space using existing security cameras. It is delivered as a Docker container you deploy to an Azure IoT Edge device with a capable GPU.

The five Person operations

OperationWhat it detectsExample use
PersonCountPeople currently in a zoneLive store occupancy vs. a cap
PersonCrossingLineA person crossing a virtual lineEntry/exit footfall counting
PersonCrossingPolygonEntering/leaving a polygon zoneRestricted-area intrusion alerts
PersonDistanceDistance between peopleSocial-distancing compliance
PersonZoneDwellTimeHow long a person stays in a zoneQueue wait time, dwell analytics

Zones and lines are defined in normalized 0-to-1 coordinates (relative to frame size), so configs survive a resolution change. focus (e.g., footprint vs. center) controls which body point triggers the event, and threshold tunes detection confidence.

{
  "type": "cognitiveservices.vision.spatialanalysis-personcrossingline",
  "input": { "source": { "type": "rtsp", "uri": "rtsp://cam:554/stream" } },
  "parameters": {
    "lines": [{ "name": "entrance",
      "start": {"x": 0.1, "y": 0.5}, "end": {"x": 0.9, "y": 0.5} }],
    "threshold": 16, "focus": "footprint"
  }
}

Deployment architecture

[IP Camera (RTSP)] -> [Azure IoT Edge device]
    -> [Spatial Analysis container]
        -> processes frames LOCALLY (no upload)
        -> emits aggregated events
            -> [Azure IoT Hub] -> [Stream Analytics] -> [Dashboard / Power BI]

Privacy by design

  • No facial recognition — people are anonymous bounding boxes, never identified.
  • No frame storage — frames are processed in memory and discarded.
  • Edge processing — raw video never leaves the device; only counts/events reach the cloud.
  • Scoped zones — monitor only specific areas, not the full field of view.

Choosing Between the Two

NeedService
Transcript, topics, faces from a recorded MP4Video Indexer
Real-time people counting / line crossingsSpatial Analysis
Process on-prem with no video leaving siteSpatial Analysis (IoT Edge)
Searchable index of a media libraryVideo Indexer

Worked Example

A retailer must enforce a live occupancy cap and measure checkout queue waits without sending customer video to the cloud. Deploy the Spatial Analysis container to an IoT Edge gateway, run PersonCount on the store-floor zone for occupancy and PersonZoneDwellTime on a polygon around the tills for wait time. Only the aggregated numbers flow to IoT Hub, satisfying privacy requirements because no faces are recognized and no frames are stored.

Video Indexer Deployment Modes and Customization

Video Indexer comes in two flavors the exam may contrast. The classic / connected account runs fully in Azure and stores media in a Microsoft-managed account. Azure Resource Manager (ARM) accounts integrate with your own Azure Storage and support managed identity and customer-managed keys for stricter governance — pick ARM when a requirement mentions bringing your own storage or enterprise security controls. Video Indexer is also customizable: you can train a person model to name recurring faces, a brand model to track brands relevant to your business, and a language model to bias the transcript toward domain vocabulary.

When a scenario asks how to make transcripts recognize industry jargon, the answer is a custom language model, not switching services.

Real-Time vs. Batch: The Core Distinction

The single most important decision in this section is matching the workload to the service. Video Indexer is batch: it ingests recorded files and produces a searchable index minutes later, so it suits media libraries, compliance review, and content discovery. Spatial Analysis is real-time: it consumes a live RTSP feed and emits events as people move, so it suits occupancy enforcement, safety alerts, and queue management. A scenario that says "alert security the moment someone enters a restricted zone" demands Spatial Analysis with PersonCrossingPolygon, because Video Indexer's after-the-fact indexing cannot fire a live alert.

On the Exam: Spatial Analysis runs on IoT Edge, not the cloud, performs no facial recognition, and uses normalized 0-1 zone/line coordinates. Dwell-time questions map to PersonZoneDwellTime; entry/exit counting maps to PersonCrossingLine; restricted-area intrusion maps to PersonCrossingPolygon. Video Indexer = batch insight mining; custom language model adds domain vocabulary to transcripts.

Test Your Knowledge

Where does Azure AI Vision Spatial Analysis process video streams?

A
B
C
D
Test Your Knowledge

Which Spatial Analysis operation measures how long customers wait in a checkout queue?

A
B
C
D
Test Your Knowledge

A team must build a searchable index of recorded conference talks, including transcripts, speakers, topics, and on-screen text. Which service should they use?

A
B
C
D
Test Your Knowledge

Why is Azure AI Vision Spatial Analysis considered privacy-by-design?

A
B
C
D