4.3 Azure AI Speech Service
Key Takeaways
- Azure AI Speech provides speech-to-text (STT), text-to-speech (TTS), speech translation, and speaker recognition capabilities.
- Speech-to-text converts spoken audio into written text — supporting real-time transcription and batch transcription of audio files.
- Text-to-speech generates natural-sounding audio from text using neural voices that sound remarkably human.
- Speech translation translates spoken audio directly from one language to another — either as text output or synthesized speech.
- Speaker recognition identifies who is speaking based on their voice — supporting both verification (is this the claimed person?) and identification (who is this?).
Azure AI Speech Service
Quick Answer: Azure AI Speech provides four core capabilities: speech-to-text (convert audio to text), text-to-speech (convert text to audio), speech translation (translate spoken language), and speaker recognition (identify who is speaking). All capabilities are available via REST API and SDKs.
What Is Azure AI Speech?
Azure AI Speech is a cloud-based service that provides a comprehensive set of speech-related AI capabilities. It handles the audio side of NLP — converting between speech and text, translating spoken language, and identifying speakers.
Core Capabilities
1. Speech-to-Text (STT)
What it does: Converts spoken audio into written text (transcription).
| Feature | Description |
|---|---|
| Real-time transcription | Transcribe audio as it is spoken (live captioning, voice commands) |
| Batch transcription | Transcribe pre-recorded audio files (meeting recordings, podcasts) |
| Custom speech | Train on your vocabulary and acoustic environment for better accuracy |
| Multi-language | Supports 100+ languages and dialects |
| Speaker diarization | Identify different speakers in a conversation ("Speaker 1:", "Speaker 2:") |
Use cases:
- Live meeting captioning and transcription
- Voice-controlled applications and devices
- Call center call transcription for quality analysis
- Accessibility — captioning for deaf or hard-of-hearing users
- Dictation applications
2. Text-to-Speech (TTS)
What it does: Converts written text into natural-sounding speech audio.
| Feature | Description |
|---|---|
| Neural voices | Over 400 natural-sounding voices across 140+ languages |
| Custom neural voice | Create a unique voice for your brand (requires ethical review) |
| SSML support | Fine-tune pronunciation, pitch, rate, and emphasis using Speech Synthesis Markup Language |
| Visemes | Synchronized mouth movements for animated characters |
Use cases:
- Screen readers and accessibility tools for visually impaired users
- Audiobook generation from text
- Interactive Voice Response (IVR) systems
- Virtual assistants and chatbots with voice
- E-learning narration
- Public announcement systems
3. Speech Translation
What it does: Translates spoken audio from one language to another in real time.
| Feature | Description |
|---|---|
| Speech-to-text translation | Spoken language → translated text |
| Speech-to-speech translation | Spoken language → translated spoken audio |
| Multi-language | 30+ languages for speech translation |
| Real-time | Near-instant translation for live conversations |
How it works: The service performs three steps:
- Recognize the spoken input (speech-to-text)
- Translate the text to the target language
- Synthesize the translated text (text-to-speech) — for speech-to-speech translation
Use cases:
- Real-time translation in international meetings
- Travel assistance (speak in your language, output in local language)
- Multilingual customer service
- Cross-language collaboration tools
4. Speaker Recognition
What it does: Identifies or verifies a person based on their unique voice characteristics.
| Mode | Description | Matching |
|---|---|---|
| Speaker verification | Confirm a claimed identity | "Is this person who they say they are?" (1:1) |
| Speaker identification | Identify an unknown speaker | "Who is this person?" (1:many) |
Use cases:
- Voice-based authentication (banking, security)
- Meeting transcription (identify who said what)
- Call center caller identification
- Smart speaker personalization (recognize family members)
Custom Speech Models
Azure AI Speech allows customization for better accuracy in specific environments:
| Customization | What It Does | When to Use |
|---|---|---|
| Custom speech-to-text | Train on your domain vocabulary and acoustic conditions | Medical, legal, or technical terminology |
| Custom neural voice | Create a unique branded voice | Brand identity, specific character voices |
| Pronunciation assessment | Evaluate speech pronunciation accuracy | Language learning applications |
Azure AI Speech vs. Azure AI Language
| Aspect | Azure AI Speech | Azure AI Language |
|---|---|---|
| Input | Audio/speech | Text |
| Primary tasks | STT, TTS, speech translation | Sentiment, NER, key phrases, CLU |
| Focus | Converting between audio and text | Understanding text meaning |
| Used together | Speech → Text (Speech) → Analysis (Language) → Response (Language) → Audio (Speech) |
On the Exam: Know that Speech handles audio-to-text and text-to-audio conversions, while Language handles text analysis. They are often used together: Speech converts audio to text, Language analyzes the text, and Speech converts the response back to audio.
Which Azure AI Speech capability converts spoken audio into written text?
A company wants to create an accessibility feature that reads web page content aloud to visually impaired users. Which Azure AI Speech capability should they use?
A banking app wants to verify a customer's identity by comparing their voice to a stored voice sample. Which Azure AI Speech capability is needed?
How does speech translation work in Azure AI Speech?
Match each Azure AI Speech capability to its use case:
Match each item on the left with the correct item on the right