4.4 Azure AI Speech Services

Key Takeaways

  • Azure AI Speech provides speech-to-text (STT), text-to-speech (TTS), speech translation, speaker recognition, and intent recognition.
  • Speech-to-text supports real-time transcription, batch transcription, and custom speech models for domain-specific vocabulary.
  • Text-to-speech uses neural voices that sound nearly human, with support for SSML (Speech Synthesis Markup Language) for fine-grained control.
  • Custom Neural Voice allows creating a unique branded voice using your own voice recordings (requires Microsoft approval).
  • Speech translation supports real-time speech-to-text and speech-to-speech translation across 100+ languages.
Last updated: March 2026

Azure AI Speech Services

Quick Answer: Azure AI Speech provides STT (real-time and batch), TTS (neural voices with SSML), speech translation (100+ languages), speaker recognition, and custom speech models. Use the Speech SDK (Python/C#) for integration.

Speech-to-Text (STT)

Real-Time Transcription

import azure.cognitiveservices.speech as speechsdk

speech_config = speechsdk.SpeechConfig(
    subscription="<your-key>",
    region="eastus"
)
speech_config.speech_recognition_language = "en-US"

# From microphone
audio_config = speechsdk.AudioConfig(use_default_microphone=True)
recognizer = speechsdk.SpeechRecognizer(
    speech_config=speech_config,
    audio_config=audio_config
)

# Single utterance recognition
result = recognizer.recognize_once()

if result.reason == speechsdk.ResultReason.RecognizedSpeech:
    print(f"Recognized: {result.text}")
elif result.reason == speechsdk.ResultReason.NoMatch:
    print("No speech could be recognized")
elif result.reason == speechsdk.ResultReason.Canceled:
    cancellation = result.cancellation_details
    print(f"Canceled: {cancellation.reason}")

Continuous Recognition

# For long-running audio (meetings, lectures)
def recognized_handler(evt):
    print(f"Recognized: {evt.result.text}")

def recognizing_handler(evt):
    print(f"Partial: {evt.result.text}")  # Intermediate results

recognizer.recognized.connect(recognized_handler)
recognizer.recognizing.connect(recognizing_handler)

recognizer.start_continuous_recognition()
# ... audio is processed as it streams
recognizer.stop_continuous_recognition()

Audio Input Sources

SourceAudioConfig MethodUse Case
Microphoneuse_default_microphone=TrueReal-time voice input
WAV filefilename="audio.wav"Processing recorded audio
Audio streamstream=push_streamStreaming from custom source

Batch Transcription

For processing large volumes of audio files:

  • Upload audio files to Azure Blob Storage
  • Submit a batch transcription job via REST API
  • Results are written to a storage container as JSON
  • Supports word-level timestamps and speaker diarization

Text-to-Speech (TTS)

Basic TTS

speech_config = speechsdk.SpeechConfig(
    subscription="<your-key>",
    region="eastus"
)

# Select a neural voice
speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"

synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)

result = synthesizer.speak_text("Hello! Welcome to our AI-powered application.")

if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
    print("Speech synthesized successfully")

SSML (Speech Synthesis Markup Language)

SSML provides fine-grained control over speech output:

ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:mstts="https://www.w3.org/2001/mstts"
       xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <mstts:express-as style="cheerful">
            Welcome to our store!
        </mstts:express-as>
        <break time="500ms"/>
        <prosody rate="slow" pitch="+5%">
            Today's special offers include twenty percent off all items.
        </prosody>
        <say-as interpret-as="telephone">+1-555-123-4567</say-as>
    </voice>
</speak>
"""

result = synthesizer.speak_ssml(ssml)

Key SSML Elements

ElementPurposeExample
<voice>Select a specific voicename="en-US-JennyNeural"
<prosody>Control rate, pitch, volumerate="slow" pitch="+10%"
<break>Insert a pausetime="500ms" or strength="strong"
<say-as>Specify pronunciation typeinterpret-as="telephone"
<mstts:express-as>Set speaking stylestyle="cheerful" or "sad"
<phoneme>Specify exact pronunciationIPA or SAPI phoneme notation
<emphasis>Stress a wordlevel="strong"

On the Exam: SSML questions are common. Know the key elements: <prosody> for rate/pitch/volume, <break> for pauses, <say-as> for pronunciation guidance, and <mstts:express-as> for speaking styles.

Speech Translation

translation_config = speechsdk.translation.SpeechTranslationConfig(
    subscription="<your-key>",
    region="eastus"
)

# Set source and target languages
translation_config.speech_recognition_language = "en-US"
translation_config.add_target_language("fr")
translation_config.add_target_language("de")
translation_config.add_target_language("es")

recognizer = speechsdk.translation.TranslationRecognizer(
    translation_config=translation_config
)

result = recognizer.recognize_once()

if result.reason == speechsdk.ResultReason.TranslatedSpeech:
    print(f"Original: {result.text}")
    for lang, translation in result.translations.items():
        print(f"  {lang}: {translation}")

Speaker Recognition

FeatureDescriptionUse Case
Speaker verification1:1 — Is this the claimed speaker?Voice authentication
Speaker identification1:N — Which enrolled speaker is this?Meeting transcription
Text-dependentRequires a specific passphraseSecure authentication
Text-independentAny speech contentFlexible identification

Custom Speech Models

Custom Speech trains a model on your domain-specific data to improve transcription accuracy:

When to Use Custom Speech

  • Industry jargon not in the base model vocabulary
  • Product names, acronyms, or technical terms
  • Noisy environments with specific acoustic characteristics
  • Accented speech from your target audience

Training Data Types

Data TypeDescription
Language dataText transcripts for vocabulary adaptation
Audio + transcriptsPaired audio/transcript files for acoustic adaptation
Pronunciation dataCustom pronunciation mappings for specific words
Test Your Knowledge

Which SSML element controls the speaking rate, pitch, and volume of synthesized speech?

A
B
C
D
Test Your Knowledge

What is the difference between recognize_once() and start_continuous_recognition() in the Speech SDK?

A
B
C
D
Test Your Knowledge

An AI application needs to convert spoken English into written French and German text simultaneously. Which service should you use?

A
B
C
D
Test Your Knowledge

Which SSML element specifies how to pronounce content like telephone numbers, dates, or addresses?

A
B
C
D