4.4 Azure AI Speech Services
Key Takeaways
- Azure AI Speech provides speech-to-text (STT), text-to-speech (TTS), speech translation, speaker recognition, and intent recognition.
- Speech-to-text supports real-time transcription, batch transcription, and custom speech models for domain-specific vocabulary.
- Text-to-speech uses neural voices that sound nearly human, with support for SSML (Speech Synthesis Markup Language) for fine-grained control.
- Custom Neural Voice allows creating a unique branded voice using your own voice recordings (requires Microsoft approval).
- Speech translation supports real-time speech-to-text and speech-to-speech translation across 100+ languages.
Azure AI Speech Services
Quick Answer: Azure AI Speech provides STT (real-time and batch), TTS (neural voices with SSML), speech translation (100+ languages), speaker recognition, and custom speech models. Use the Speech SDK (Python/C#) for integration.
Speech-to-Text (STT)
Real-Time Transcription
import azure.cognitiveservices.speech as speechsdk
speech_config = speechsdk.SpeechConfig(
subscription="<your-key>",
region="eastus"
)
speech_config.speech_recognition_language = "en-US"
# From microphone
audio_config = speechsdk.AudioConfig(use_default_microphone=True)
recognizer = speechsdk.SpeechRecognizer(
speech_config=speech_config,
audio_config=audio_config
)
# Single utterance recognition
result = recognizer.recognize_once()
if result.reason == speechsdk.ResultReason.RecognizedSpeech:
print(f"Recognized: {result.text}")
elif result.reason == speechsdk.ResultReason.NoMatch:
print("No speech could be recognized")
elif result.reason == speechsdk.ResultReason.Canceled:
cancellation = result.cancellation_details
print(f"Canceled: {cancellation.reason}")
Continuous Recognition
# For long-running audio (meetings, lectures)
def recognized_handler(evt):
print(f"Recognized: {evt.result.text}")
def recognizing_handler(evt):
print(f"Partial: {evt.result.text}") # Intermediate results
recognizer.recognized.connect(recognized_handler)
recognizer.recognizing.connect(recognizing_handler)
recognizer.start_continuous_recognition()
# ... audio is processed as it streams
recognizer.stop_continuous_recognition()
Audio Input Sources
| Source | AudioConfig Method | Use Case |
|---|---|---|
| Microphone | use_default_microphone=True | Real-time voice input |
| WAV file | filename="audio.wav" | Processing recorded audio |
| Audio stream | stream=push_stream | Streaming from custom source |
Batch Transcription
For processing large volumes of audio files:
- Upload audio files to Azure Blob Storage
- Submit a batch transcription job via REST API
- Results are written to a storage container as JSON
- Supports word-level timestamps and speaker diarization
Text-to-Speech (TTS)
Basic TTS
speech_config = speechsdk.SpeechConfig(
subscription="<your-key>",
region="eastus"
)
# Select a neural voice
speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"
synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)
result = synthesizer.speak_text("Hello! Welcome to our AI-powered application.")
if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
print("Speech synthesized successfully")
SSML (Speech Synthesis Markup Language)
SSML provides fine-grained control over speech output:
ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:mstts="https://www.w3.org/2001/mstts"
xml:lang="en-US">
<voice name="en-US-JennyNeural">
<mstts:express-as style="cheerful">
Welcome to our store!
</mstts:express-as>
<break time="500ms"/>
<prosody rate="slow" pitch="+5%">
Today's special offers include twenty percent off all items.
</prosody>
<say-as interpret-as="telephone">+1-555-123-4567</say-as>
</voice>
</speak>
"""
result = synthesizer.speak_ssml(ssml)
Key SSML Elements
| Element | Purpose | Example |
|---|---|---|
| <voice> | Select a specific voice | name="en-US-JennyNeural" |
| <prosody> | Control rate, pitch, volume | rate="slow" pitch="+10%" |
| <break> | Insert a pause | time="500ms" or strength="strong" |
| <say-as> | Specify pronunciation type | interpret-as="telephone" |
| <mstts:express-as> | Set speaking style | style="cheerful" or "sad" |
| <phoneme> | Specify exact pronunciation | IPA or SAPI phoneme notation |
| <emphasis> | Stress a word | level="strong" |
On the Exam: SSML questions are common. Know the key elements: <prosody> for rate/pitch/volume, <break> for pauses, <say-as> for pronunciation guidance, and <mstts:express-as> for speaking styles.
Speech Translation
translation_config = speechsdk.translation.SpeechTranslationConfig(
subscription="<your-key>",
region="eastus"
)
# Set source and target languages
translation_config.speech_recognition_language = "en-US"
translation_config.add_target_language("fr")
translation_config.add_target_language("de")
translation_config.add_target_language("es")
recognizer = speechsdk.translation.TranslationRecognizer(
translation_config=translation_config
)
result = recognizer.recognize_once()
if result.reason == speechsdk.ResultReason.TranslatedSpeech:
print(f"Original: {result.text}")
for lang, translation in result.translations.items():
print(f" {lang}: {translation}")
Speaker Recognition
| Feature | Description | Use Case |
|---|---|---|
| Speaker verification | 1:1 — Is this the claimed speaker? | Voice authentication |
| Speaker identification | 1:N — Which enrolled speaker is this? | Meeting transcription |
| Text-dependent | Requires a specific passphrase | Secure authentication |
| Text-independent | Any speech content | Flexible identification |
Custom Speech Models
Custom Speech trains a model on your domain-specific data to improve transcription accuracy:
When to Use Custom Speech
- Industry jargon not in the base model vocabulary
- Product names, acronyms, or technical terms
- Noisy environments with specific acoustic characteristics
- Accented speech from your target audience
Training Data Types
| Data Type | Description |
|---|---|
| Language data | Text transcripts for vocabulary adaptation |
| Audio + transcripts | Paired audio/transcript files for acoustic adaptation |
| Pronunciation data | Custom pronunciation mappings for specific words |
Which SSML element controls the speaking rate, pitch, and volume of synthesized speech?
What is the difference between recognize_once() and start_continuous_recognition() in the Speech SDK?
An AI application needs to convert spoken English into written French and German text simultaneously. Which service should you use?
Which SSML element specifies how to pronounce content like telephone numbers, dates, or addresses?