Which Azure AI Speech capability converts spoken audio into written text?

Speech-to-text. Speech-to-text (STT) converts spoken audio into written text. This is used for transcription, live captioning, voice commands, and dictation. Text-to-speech does the opposite — converting text into audio.

A company wants to create an accessibility feature that reads web page content aloud to visually impaired users. Which Azure AI Speech capability should they use?

Text-to-speech. Text-to-speech (TTS) converts written text into natural-sounding audio. This is the core technology behind screen readers and accessibility tools that read content aloud to visually impaired users.

A banking app wants to verify a customer's identity by comparing their voice to a stored voice sample. Which Azure AI Speech capability is needed?

Speaker verification. Speaker verification (a mode of speaker recognition) confirms whether a person is who they claim to be by comparing their voice to a stored voice sample (1:1 matching). This is used for voice-based authentication in banking, security, and identity verification.

How does speech translation work in Azure AI Speech?

It recognizes speech, translates the text, and optionally synthesizes the translation as speech. Speech translation performs three steps: (1) speech-to-text recognition of the source language, (2) text translation to the target language, and (3) optionally, text-to-speech synthesis of the translated text. This enables real-time spoken language translation.

Azure AI Speech Service

Quick Answer: Azure AI Speech provides four core capabilities: speech-to-text (convert audio to text), text-to-speech (convert text to audio), speech translation (translate spoken language), and speaker recognition (identify who is speaking). All capabilities are available via REST API and SDKs.

What Is Azure AI Speech?

Azure AI Speech is a cloud-based service that provides a comprehensive set of speech-related AI capabilities. It handles the audio side of NLP — converting between speech and text, translating spoken language, and identifying speakers.

Core Capabilities

1. Speech-to-Text (STT)

What it does: Converts spoken audio into written text (transcription).

Feature	Description
Real-time transcription	Transcribe audio as it is spoken (live captioning, voice commands)
Batch transcription	Transcribe pre-recorded audio files (meeting recordings, podcasts)
Custom speech	Train on your vocabulary and acoustic environment for better accuracy
Multi-language	Supports 100+ languages and dialects
Speaker diarization	Identify different speakers in a conversation ("Speaker 1:", "Speaker 2:")

Use cases:

Live meeting captioning and transcription
Voice-controlled applications and devices
Call center call transcription for quality analysis
Accessibility — captioning for deaf or hard-of-hearing users
Dictation applications

2. Text-to-Speech (TTS)

What it does: Converts written text into natural-sounding speech audio.

Feature	Description
Neural voices	Over 400 natural-sounding voices across 140+ languages
Custom neural voice	Create a unique voice for your brand (requires ethical review)
SSML support	Fine-tune pronunciation, pitch, rate, and emphasis using Speech Synthesis Markup Language
Visemes	Synchronized mouth movements for animated characters

Use cases:

Screen readers and accessibility tools for visually impaired users
Audiobook generation from text
Interactive Voice Response (IVR) systems
Virtual assistants and chatbots with voice
E-learning narration
Public announcement systems

3. Speech Translation

What it does: Translates spoken audio from one language to another in real time.

Feature	Description
Speech-to-text translation	Spoken language → translated text
Speech-to-speech translation	Spoken language → translated spoken audio
Multi-language	30+ languages for speech translation
Real-time	Near-instant translation for live conversations

How it works: The service performs three steps:

Recognize the spoken input (speech-to-text)
Translate the text to the target language
Synthesize the translated text (text-to-speech) — for speech-to-speech translation

Use cases:

Real-time translation in international meetings
Travel assistance (speak in your language, output in local language)
Multilingual customer service
Cross-language collaboration tools

4. Speaker Recognition

What it does: Identifies or verifies a person based on their unique voice characteristics.

Mode	Description	Matching
Speaker verification	Confirm a claimed identity	"Is this person who they say they are?" (1:1)
Speaker identification	Identify an unknown speaker	"Who is this person?" (1:many)

Use cases:

Voice-based authentication (banking, security)
Meeting transcription (identify who said what)
Call center caller identification
Smart speaker personalization (recognize family members)

Custom Speech Models

Azure AI Speech allows customization for better accuracy in specific environments:

Customization	What It Does	When to Use
Custom speech-to-text	Train on your domain vocabulary and acoustic conditions	Medical, legal, or technical terminology
Custom neural voice	Create a unique branded voice	Brand identity, specific character voices
Pronunciation assessment	Evaluate speech pronunciation accuracy	Language learning applications

Azure AI Speech vs. Azure AI Language

Aspect	Azure AI Speech	Azure AI Language
Input	Audio/speech	Text
Primary tasks	STT, TTS, speech translation	Sentiment, NER, key phrases, CLU
Focus	Converting between audio and text	Understanding text meaning
Used together	Speech → Text (Speech) → Analysis (Language) → Response (Language) → Audio (Speech)

On the Exam: Know that Speech handles audio-to-text and text-to-audio conversions, while Language handles text analysis. They are often used together: Speech converts audio to text, Language analyzes the text, and Speech converts the response back to audio.

Microsoft Azure AI Fundamentals

4.3 Azure AI Speech Service

Key Takeaways

Azure AI Speech Service

What Is Azure AI Speech?

Core Capabilities

1. Speech-to-Text (STT)

2. Text-to-Speech (TTS)

3. Speech Translation

4. Speaker Recognition

Custom Speech Models

Azure AI Speech vs. Azure AI Language

Microsoft Azure AI Fundamentals

1Introduction

2Domain 1: Describe AI Workloads and Considerations (15-20%)

3Domain 2: Fundamental Principles of Machine Learning on Azure (20-25%)

4Domain 3: Computer Vision Workloads on Azure (15-20%)

5Domain 4: Natural Language Processing Workloads on Azure (15-20%)

6Domain 5: Generative AI Workloads on Azure (15-20%)

7Exam Review and Full-Length Practice Questions

4.3 Azure AI Speech Service

Key Takeaways

Azure AI Speech Service

What Is Azure AI Speech?

Core Capabilities

1. Speech-to-Text (STT)

2. Text-to-Speech (TTS)

3. Speech Translation

4. Speaker Recognition

Custom Speech Models

Azure AI Speech vs. Azure AI Language