3.5 Video Analysis and Spatial Analysis
Key Takeaways
- Azure AI Video Indexer extracts audio insights (transcription, translation, speaker ID, sentiment, topics) and visual insights (faces, scenes, shots, OCR, objects, brands) from uploaded video.
- Video Indexer is asynchronous: upload returns a video id, then you poll the Index endpoint until processing completes before reading insights JSON.
- Spatial Analysis runs as a Docker container on Azure IoT Edge, processing camera RTSP streams locally and sending only aggregated events to the cloud.
- Spatial Analysis operations include PersonCount, PersonCrossingLine, PersonCrossingPolygon, PersonDistance, and PersonZoneDwellTime, configured with zones or lines in normalized 0-1 coordinates.
- Spatial Analysis is privacy-by-design: it represents people as bounding boxes, performs no facial recognition, and discards frames after processing.
Quick Answer: Video Indexer mines uploaded video for transcripts, faces, scenes, topics, brands, and on-screen text — asynchronously. Spatial Analysis analyzes live camera feeds at the edge to count people, detect line/zone crossings, measure distance, and track dwell time. Spatial Analysis runs as an IoT Edge container and does no facial recognition.
Azure AI Video Indexer
Video Indexer is a multi-model pipeline that produces a single rich insights JSON. Insights split into audio and visual.
Audio insights
| Insight | Description |
|---|---|
| Transcription | Speech-to-text of all spoken content |
| Translation | Translate the transcript to other languages |
| Speaker identification | Distinguish individual speakers |
| Sentiment | Positive/neutral/negative tone over time |
| Audio effects | Clapping, silence, crowd noise |
| Topics & keywords | Discussion topics from the transcript |
Visual insights
| Insight | Description |
|---|---|
| Face detection | Locate (and optionally name) faces over time |
| Scene & shot detection | Segment boundaries and camera cuts |
| OCR | Text burned into frames (signs, captions) |
| Object & brand detection | Recognize objects and logos |
| Thumbnail extraction | Representative keyframes |
Asynchronous workflow and access tokens
Video Indexer is not synchronous. You first obtain an access token from the Authorization API (Account, User, Project, or Video scope; tokens expire after about an hour). You then upload, receive a video id, and poll the Index endpoint until state == "Processed" before reading insights.
import requests
up = (f"https://api.videoindexer.ai/{location}/Accounts/{account_id}/Videos"
f"?name=demo&privacy=Private&accessToken={token}")
video_id = requests.post(up, files={"file": open("video.mp4", "rb")}).json()["id"]
idx = (f"https://api.videoindexer.ai/{location}/Accounts/{account_id}"
f"/Videos/{video_id}/Index?accessToken={token}")
insights = requests.get(idx).json() # poll until state == 'Processed'
for seg in insights["videos"][0]["insights"]["transcript"]:
print(seg["text"])
On the Exam: Video Indexer = batch/async insight mining of recorded files; expect a token + upload + poll flow. Do not confuse it with Spatial Analysis, which is real-time on live streams.
Spatial Analysis
Spatial Analysis is part of Azure AI Vision and understands how people move through a physical space using existing security cameras. It is delivered as a Docker container you deploy to an Azure IoT Edge device with a capable GPU.
The five Person operations
| Operation | What it detects | Example use |
|---|---|---|
| PersonCount | People currently in a zone | Live store occupancy vs. a cap |
| PersonCrossingLine | A person crossing a virtual line | Entry/exit footfall counting |
| PersonCrossingPolygon | Entering/leaving a polygon zone | Restricted-area intrusion alerts |
| PersonDistance | Distance between people | Social-distancing compliance |
| PersonZoneDwellTime | How long a person stays in a zone | Queue wait time, dwell analytics |
Zones and lines are defined in normalized 0-to-1 coordinates (relative to frame size), so configs survive a resolution change. focus (e.g., footprint vs. center) controls which body point triggers the event, and threshold tunes detection confidence.
{
"type": "cognitiveservices.vision.spatialanalysis-personcrossingline",
"input": { "source": { "type": "rtsp", "uri": "rtsp://cam:554/stream" } },
"parameters": {
"lines": [{ "name": "entrance",
"start": {"x": 0.1, "y": 0.5}, "end": {"x": 0.9, "y": 0.5} }],
"threshold": 16, "focus": "footprint"
}
}
Deployment architecture
[IP Camera (RTSP)] -> [Azure IoT Edge device]
-> [Spatial Analysis container]
-> processes frames LOCALLY (no upload)
-> emits aggregated events
-> [Azure IoT Hub] -> [Stream Analytics] -> [Dashboard / Power BI]
Privacy by design
- No facial recognition — people are anonymous bounding boxes, never identified.
- No frame storage — frames are processed in memory and discarded.
- Edge processing — raw video never leaves the device; only counts/events reach the cloud.
- Scoped zones — monitor only specific areas, not the full field of view.
Choosing Between the Two
| Need | Service |
|---|---|
| Transcript, topics, faces from a recorded MP4 | Video Indexer |
| Real-time people counting / line crossings | Spatial Analysis |
| Process on-prem with no video leaving site | Spatial Analysis (IoT Edge) |
| Searchable index of a media library | Video Indexer |
Worked Example
A retailer must enforce a live occupancy cap and measure checkout queue waits without sending customer video to the cloud. Deploy the Spatial Analysis container to an IoT Edge gateway, run PersonCount on the store-floor zone for occupancy and PersonZoneDwellTime on a polygon around the tills for wait time. Only the aggregated numbers flow to IoT Hub, satisfying privacy requirements because no faces are recognized and no frames are stored.
Video Indexer Deployment Modes and Customization
Video Indexer comes in two flavors the exam may contrast. The classic / connected account runs fully in Azure and stores media in a Microsoft-managed account. Azure Resource Manager (ARM) accounts integrate with your own Azure Storage and support managed identity and customer-managed keys for stricter governance — pick ARM when a requirement mentions bringing your own storage or enterprise security controls. Video Indexer is also customizable: you can train a person model to name recurring faces, a brand model to track brands relevant to your business, and a language model to bias the transcript toward domain vocabulary.
When a scenario asks how to make transcripts recognize industry jargon, the answer is a custom language model, not switching services.
Real-Time vs. Batch: The Core Distinction
The single most important decision in this section is matching the workload to the service. Video Indexer is batch: it ingests recorded files and produces a searchable index minutes later, so it suits media libraries, compliance review, and content discovery. Spatial Analysis is real-time: it consumes a live RTSP feed and emits events as people move, so it suits occupancy enforcement, safety alerts, and queue management. A scenario that says "alert security the moment someone enters a restricted zone" demands Spatial Analysis with PersonCrossingPolygon, because Video Indexer's after-the-fact indexing cannot fire a live alert.
On the Exam: Spatial Analysis runs on IoT Edge, not the cloud, performs no facial recognition, and uses normalized 0-1 zone/line coordinates. Dwell-time questions map to PersonZoneDwellTime; entry/exit counting maps to PersonCrossingLine; restricted-area intrusion maps to PersonCrossingPolygon. Video Indexer = batch insight mining; custom language model adds domain vocabulary to transcripts.
Where does Azure AI Vision Spatial Analysis process video streams?
Which Spatial Analysis operation measures how long customers wait in a checkout queue?
A team must build a searchable index of recorded conference talks, including transcripts, speakers, topics, and on-screen text. Which service should they use?
Why is Azure AI Vision Spatial Analysis considered privacy-by-design?