4.4 Azure AI Speech Services

Key Takeaways

  • Azure AI Speech provides speech-to-text (real-time and batch), text-to-speech with neural voices, speech translation, speaker recognition, and pronunciation assessment.
  • Real-time STT uses recognize_once() for a single utterance and start_continuous_recognition() with event handlers for streaming audio.
  • SSML controls TTS output: <prosody> for rate/pitch/volume, <break> for pauses, <say-as> for interpretation, <mstts:express-as> for speaking styles, and <phoneme> for exact pronunciation.
  • Custom Speech adapts the base model with language data, audio+transcripts, or pronunciation files; Custom Neural Voice requires Microsoft approval (limited access).
  • Speech translation does speech-to-text and speech-to-speech translation in one pipeline by setting a recognition language and adding target languages.
Last updated: June 2026

Quick Answer: Azure AI Speech offers speech-to-text (STT), text-to-speech (TTS) with neural voices, speech translation (real-time speech-to-text and speech-to-speech), speaker recognition, and pronunciation assessment. You integrate via the Speech SDK (Python, C#, JavaScript) using a SpeechConfig with a key and region — region is mandatory because Speech routes to regional endpoints.

Speech-to-Text Modes

ModeMethod / mechanismUse case
Single utterancerecognize_once()short command, push-to-talk
Continuousstart_continuous_recognition() + eventsmeetings, dictation, live captions
BatchREST job over Blob audiobulk recorded files, async

Continuous recognition fires recognizing (partial, interim) and recognized (final) events; you must call stop_continuous_recognition() to end it. Batch transcription submits a job over audio in Blob Storage and supports word-level timestamps and speaker diarization in the JSON output.

import azure.cognitiveservices.speech as speechsdk
cfg = speechsdk.SpeechConfig(subscription="<key>", region="eastus")
cfg.speech_recognition_language = "en-US"
rec = speechsdk.SpeechRecognizer(speech_config=cfg,
        audio_config=speechsdk.AudioConfig(use_default_microphone=True))
result = rec.recognize_once()
if result.reason == speechsdk.ResultReason.RecognizedSpeech:
    print(result.text)
elif result.reason == speechsdk.ResultReason.Canceled:
    print(result.cancellation_details.reason)   # inspect for auth/quota errors

Text-to-Speech and SSML

TTS uses neural voices named <locale>-<Name>Neural, e.g. en-US-JennyNeural. Speech Synthesis Markup Language (SSML) gives fine control:

ElementControlsExample
<voice>which voicename="en-US-JennyNeural"
<prosody>rate, pitch, volumerate="slow" pitch="+10%"
<break>pausestime="500ms" / strength="strong"
<say-as>interpret contentinterpret-as="telephone"
<mstts:express-as>speaking stylestyle="cheerful"
<phoneme>exact pronunciationIPA or SAPI alphabet
<emphasis>stresslevel="strong"
<speak version="1.0" xmlns:mstts="https://www.w3.org/2001/mstts" xml:lang="en-US">
  <voice name="en-US-JennyNeural">
    <mstts:express-as style="cheerful">Welcome!</mstts:express-as>
    <break time="500ms"/>
    <prosody rate="slow" pitch="+5%">Twenty percent off today.</prosody>
    <say-as interpret-as="telephone">+15551234567</say-as>
  </voice>
</speak>

On the Exam: Match the element to the need — adjust speed/pitch<prosody>; insert a pause<break>; read digits as a phone number/date<say-as>; sound happy/sad<mstts:express-as>; force a pronunciation<phoneme>.

Speech Translation

Use SpeechTranslationConfig: set speech_recognition_language once and call add_target_language() per target. One pipeline yields recognized source text plus translations to several languages simultaneously — more efficient than chaining STT then Translator.

tc = speechsdk.translation.SpeechTranslationConfig(subscription="<key>", region="eastus")
tc.speech_recognition_language = "en-US"
for lang in ("fr", "de", "es"): tc.add_target_language(lang)

Speaker Recognition and Pronunciation Assessment

CapabilityWhat it does
Speaker verification (1:1)confirms a claimed identity (voice login)
Speaker identification (1:N)finds which enrolled speaker is talking
Text-dependentrequires a fixed passphrase
Text-independentany speech content
Pronunciation assessmentscores accuracy, fluency, completeness, prosody for language learning

Custom Speech vs. Custom Neural Voice

Custom Speech improves recognition accuracy for jargon, product names, accents, or noisy audio. Training data types: language data (text for vocabulary), audio + transcripts (acoustic adaptation), and pronunciation files. Custom Neural Voice (CNV) creates a unique synthetic voice from your recordings and is limited access — it requires a Microsoft application and approval to prevent misuse.

Common Trap: "Improve transcription of medical terms" = Custom Speech, not Custom Neural Voice (which is for building a branded TTS voice). And whenever a scenario says "convert spoken English into written French and German together," choose Speech Translation, not separate STT plus Translator calls.

Authentication, Regions, and the Audio Pipeline

Every Speech SDK scenario starts with a SpeechConfig, and the exam expects you to know that it takes both a key and a region, or alternatively an authorization token for short-lived credentials. Unlike the Language and Translator services, the Speech endpoint is region-bound, so a config built with the wrong region simply cannot reach your resource. When a recognition result returns the Canceled reason, the right debugging move is to read cancellation_details, whose error code distinguishes an authentication failure from quota throttling or a bad audio format.

Audio input is configured separately from the recognizer through an AudioConfig. The three input paths are the default microphone, a WAV file by filename, and a push or pull stream for custom sources such as telephony or a browser. The base recognizer expects 16 kHz or 8 kHz, 16-bit, mono PCM WAV; feeding it stereo MP3 is a classic cause of NoMatch results. For very long or high-volume audio, batch transcription is preferred over continuous recognition because it runs server-side over files already in Blob Storage and can emit diarization and timestamps without holding an open connection.

Mapping Speech Capabilities to Requirements

Under time pressure, map the verb in the question to the service. "Transcribe" or "caption" points to speech-to-text; "read aloud" or "voice response" points to text-to-speech; "speak in our brand's unique voice" is the limited-access Custom Neural Voice; "recognize who is speaking" is speaker recognition, split into one-to-one verification for login and one-to-many identification for meetings; and "score how well a learner pronounces words" is pronunciation assessment. When the requirement combines listening and translating, speech translation collapses the pipeline into one call.

Memorizing these verb-to-feature mappings, plus the SSML element table, covers the overwhelming majority of Speech questions on the exam. Remember too that speech translation can emit synthesized audio in the target language, not only text, when you set a synthesis voice, which is the distinction between speech-to-text translation and speech-to-speech translation that the exam sometimes tests directly.

Test Your Knowledge

Which SSML element adjusts the speaking rate, pitch, and volume of synthesized speech?

A
B
C
D
Test Your Knowledge

An application transcribes a multi-hour conference livestream and must surface interim results as people speak. Which approach should it use?

A
B
C
D
Test Your Knowledge

A hospital wants its dictation system to correctly transcribe unusual drug names that the base model mishears. Which Speech feature addresses this?

A
B
C
D