3.5 Video Analysis and Spatial Analysis

Key Takeaways

  • Azure AI Video Indexer extracts insights from video including transcripts, faces, scenes, topics, brands, emotions, and visual text.
  • Video Indexer supports audio insights (transcription, translation, speaker identification) and visual insights (face detection, scene detection, OCR).
  • Spatial Analysis uses Azure AI Vision to analyze people movement in real-time video feeds from cameras — counting people, detecting occupancy, and tracking movement.
  • Spatial Analysis runs as a Docker container on Azure IoT Edge devices for real-time, low-latency processing at the edge.
  • Spatial Analysis operations include PersonCount, PersonCrossingLine, PersonCrossingPolygon, PersonDistance, and PersonZoneDwellTime.
Last updated: March 2026

Video Analysis and Spatial Analysis

Quick Answer: Video Indexer extracts rich insights from videos (transcripts, faces, scenes, topics). Spatial Analysis analyzes real-time video feeds from cameras to count people, detect occupancy, and track movement patterns. Spatial Analysis runs as an edge container on IoT Edge.

Azure AI Video Indexer

Video Indexer analyzes video and audio content to extract structured insights:

Audio Insights

InsightDescription
TranscriptionSpeech-to-text for all spoken content
TranslationTranslate transcripts to other languages
Speaker identificationIdentify and distinguish individual speakers
Sentiment analysisDetect emotional tone of spoken content
Audio effectsDetect clapping, silence, crowd noise
Topic detectionExtract discussion topics from the transcript

Visual Insights

InsightDescription
Face detectionDetect and identify faces throughout the video
Scene detectionIdentify scene changes and segment boundaries
Shot detectionDetect camera shots and transitions
OCRExtract text visible in video frames (signs, captions)
Object detectionIdentify objects in video frames
Brand detectionDetect brand logos and mentions
Thumbnail extractionGenerate representative thumbnails

Video Indexer API Usage

import requests

# Upload a video for indexing
upload_url = (
    f"https://api.videoindexer.ai/"
    f"{location}/Accounts/{account_id}/Videos"
    f"?name=my-video&privacy=Private"
    f"&accessToken={access_token}"
)
files = {"file": open("video.mp4", "rb")}
response = requests.post(upload_url, files=files)
video_id = response.json()["id"]

# Get video insights
insights_url = (
    f"https://api.videoindexer.ai/"
    f"{location}/Accounts/{account_id}/Videos/{video_id}/Index"
    f"?accessToken={access_token}"
)
insights = requests.get(insights_url).json()

# Access transcript
for transcript in insights["videos"][0]["insights"]["transcript"]:
    print(f"[{transcript['speakerName']}]: {transcript['text']}")

Spatial Analysis

Spatial Analysis uses computer vision to analyze real-time video streams and understand how people move through physical spaces.

Spatial Analysis Operations

OperationDescriptionUse Case
PersonCountCount people in a defined zoneStore occupancy limits
PersonCrossingLineDetect when people cross a virtual lineEntry/exit counting
PersonCrossingPolygonDetect when people enter/exit a polygon zoneRestricted area monitoring
PersonDistanceMeasure distance between peopleSocial distancing compliance
PersonZoneDwellTimeMeasure how long people stay in a zoneQueue wait time analysis

Deployment Architecture

[Camera(s)] → [Azure IoT Edge Device]
                └── [Spatial Analysis Container]
                    ├── Process video frames locally
                    ├── Detect people and track movement
                    └── Send aggregated events to cloud
                        └── [Azure IoT Hub] → [Azure Stream Analytics] → [Dashboard]

Configuration Example (JSON)

{
    "version": 1,
    "type": "cognitiveservices.vision.spatialanalysis-personcrossingline",
    "input": {
        "source": {
            "type": "rtsp",
            "uri": "rtsp://camera-ip:554/stream"
        }
    },
    "parameters": {
        "lines": [
            {
                "name": "entrance-line",
                "start": {"x": 0.1, "y": 0.5},
                "end": {"x": 0.9, "y": 0.5}
            }
        ],
        "threshold": 16,
        "focus": "footprint"
    }
}

Privacy and Responsible AI

Spatial Analysis is designed with privacy in mind:

  • No facial recognition: People are represented as bounding boxes, not identified by face
  • No image storage: Video frames are processed in memory and immediately discarded
  • Edge processing: Video stays on the local device — only aggregated counts/events are sent to the cloud
  • Configurable zones: Only monitor specific areas, not the entire camera view

On the Exam: Know that Spatial Analysis runs on IoT Edge (not in the cloud), processes video locally for privacy, and does NOT perform facial recognition. Questions may test these privacy-by-design features.

Test Your Knowledge

Where does Azure AI Vision Spatial Analysis process video streams?

A
B
C
D
Test Your Knowledge

Which Spatial Analysis operation would you use to measure how long customers wait in a checkout line?

A
B
C
D
Test Your Knowledge

Which of the following insights can Azure AI Video Indexer extract? (Select the best answer)

A
B
C
D