Which SSML element controls the speaking rate, pitch, and volume of synthesized speech?

. The element in SSML controls speaking rate, pitch, and volume. For example: . selects the voice, inserts pauses, and specifies how to interpret content (telephone, date, etc.).

What is the difference between recognize_once() and start_continuous_recognition() in the Speech SDK?

recognize_once() processes a single utterance; continuous processes ongoing audio until stopped. recognize_once() processes a single utterance (stops after detecting a pause in speech). start_continuous_recognition() processes ongoing audio, firing events for each recognized utterance until stop_continuous_recognition() is called. Use continuous for meetings, lectures, or streaming audio.

An AI application needs to convert spoken English into written French and German text simultaneously. Which service should you use?

Azure AI Speech (Speech Translation). Azure AI Speech Translation can convert spoken language into text in multiple target languages simultaneously. You set the source language and add multiple target languages. This is more efficient than using STT + Translator separately because it handles both steps in one pipeline.

Which SSML element specifies how to pronounce content like telephone numbers, dates, or addresses?

. The element specifies how to interpret and pronounce content. Common interpret-as values include "telephone", "date", "address", "cardinal", "ordinal", "spell-out", and "characters". For example, +15551234567 reads the number as a phone number.

Azure AI Speech Services

Quick Answer: Azure AI Speech provides STT (real-time and batch), TTS (neural voices with SSML), speech translation (100+ languages), speaker recognition, and custom speech models. Use the Speech SDK (Python/C#) for integration.

Speech-to-Text (STT)

Real-Time Transcription

import azure.cognitiveservices.speech as speechsdk

speech_config = speechsdk.SpeechConfig(
    subscription="<your-key>",
    region="eastus"
)
speech_config.speech_recognition_language = "en-US"

# From microphone
audio_config = speechsdk.AudioConfig(use_default_microphone=True)
recognizer = speechsdk.SpeechRecognizer(
    speech_config=speech_config,
    audio_config=audio_config
)

# Single utterance recognition
result = recognizer.recognize_once()

if result.reason == speechsdk.ResultReason.RecognizedSpeech:
    print(f"Recognized: {result.text}")
elif result.reason == speechsdk.ResultReason.NoMatch:
    print("No speech could be recognized")
elif result.reason == speechsdk.ResultReason.Canceled:
    cancellation = result.cancellation_details
    print(f"Canceled: {cancellation.reason}")

Continuous Recognition

# For long-running audio (meetings, lectures)
def recognized_handler(evt):
    print(f"Recognized: {evt.result.text}")

def recognizing_handler(evt):
    print(f"Partial: {evt.result.text}")  # Intermediate results

recognizer.recognized.connect(recognized_handler)
recognizer.recognizing.connect(recognizing_handler)

recognizer.start_continuous_recognition()
# ... audio is processed as it streams
recognizer.stop_continuous_recognition()

Audio Input Sources

Source	AudioConfig Method	Use Case
Microphone	`use_default_microphone=True`	Real-time voice input
WAV file	`filename="audio.wav"`	Processing recorded audio
Audio stream	`stream=push_stream`	Streaming from custom source

Batch Transcription

For processing large volumes of audio files:

Upload audio files to Azure Blob Storage
Submit a batch transcription job via REST API
Results are written to a storage container as JSON
Supports word-level timestamps and speaker diarization

Text-to-Speech (TTS)

Basic TTS

speech_config = speechsdk.SpeechConfig(
    subscription="<your-key>",
    region="eastus"
)

# Select a neural voice
speech_config.speech_synthesis_voice_name = "en-US-JennyNeural"

synthesizer = speechsdk.SpeechSynthesizer(speech_config=speech_config)

result = synthesizer.speak_text("Hello! Welcome to our AI-powered application.")

if result.reason == speechsdk.ResultReason.SynthesizingAudioCompleted:
    print("Speech synthesized successfully")

SSML (Speech Synthesis Markup Language)

SSML provides fine-grained control over speech output:

ssml = """
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:mstts="https://www.w3.org/2001/mstts"
       xml:lang="en-US">
    <voice name="en-US-JennyNeural">
        <mstts:express-as style="cheerful">
            Welcome to our store!
        </mstts:express-as>
        <break time="500ms"/>
        <prosody rate="slow" pitch="+5%">
            Today's special offers include twenty percent off all items.
        </prosody>
        <say-as interpret-as="telephone">+1-555-123-4567</say-as>
    </voice>
</speak>
"""

result = synthesizer.speak_ssml(ssml)

Key SSML Elements

Element	Purpose	Example
<voice>	Select a specific voice	`name="en-US-JennyNeural"`
<prosody>	Control rate, pitch, volume	`rate="slow" pitch="+10%"`
<break>	Insert a pause	`time="500ms"` or `strength="strong"`
<say-as>	Specify pronunciation type	`interpret-as="telephone"`
<mstts:express-as>	Set speaking style	`style="cheerful"` or `"sad"`
<phoneme>	Specify exact pronunciation	IPA or SAPI phoneme notation
<emphasis>	Stress a word	`level="strong"`

On the Exam: SSML questions are common. Know the key elements: <prosody> for rate/pitch/volume, <break> for pauses, <say-as> for pronunciation guidance, and <mstts:express-as> for speaking styles.

Speech Translation

translation_config = speechsdk.translation.SpeechTranslationConfig(
    subscription="<your-key>",
    region="eastus"
)

# Set source and target languages
translation_config.speech_recognition_language = "en-US"
translation_config.add_target_language("fr")
translation_config.add_target_language("de")
translation_config.add_target_language("es")

recognizer = speechsdk.translation.TranslationRecognizer(
    translation_config=translation_config
)

result = recognizer.recognize_once()

if result.reason == speechsdk.ResultReason.TranslatedSpeech:
    print(f"Original: {result.text}")
    for lang, translation in result.translations.items():
        print(f"  {lang}: {translation}")

Speaker Recognition

Feature	Description	Use Case
Speaker verification	1:1 — Is this the claimed speaker?	Voice authentication
Speaker identification	1:N — Which enrolled speaker is this?	Meeting transcription
Text-dependent	Requires a specific passphrase	Secure authentication
Text-independent	Any speech content	Flexible identification

Custom Speech Models

Custom Speech trains a model on your domain-specific data to improve transcription accuracy:

When to Use Custom Speech

Industry jargon not in the base model vocabulary
Product names, acronyms, or technical terms
Noisy environments with specific acoustic characteristics
Accented speech from your target audience

Training Data Types

Data Type	Description
Language data	Text transcripts for vocabulary adaptation
Audio + transcripts	Paired audio/transcript files for acoustic adaptation
Pronunciation data	Custom pronunciation mappings for specific words

Azure AI Engineer Associate

4.4 Azure AI Speech Services

Key Takeaways

Azure AI Speech Services

Speech-to-Text (STT)

Real-Time Transcription

Continuous Recognition

Audio Input Sources

Batch Transcription

Text-to-Speech (TTS)

Basic TTS

SSML (Speech Synthesis Markup Language)

Key SSML Elements

Speech Translation

Speaker Recognition

Custom Speech Models

When to Use Custom Speech

Training Data Types

Azure AI Engineer Associate

1Introduction

2Domain 1: Plan and Manage an Azure AI Solution (15-20%)

3Domain 2: Implement Content Moderation Solutions (10-15%)

4Domain 3: Implement Computer Vision Solutions (15-20%)

5Domain 4: Implement Natural Language Processing Solutions (25-30%)

6Domain 5: Implement Knowledge Mining and Document Intelligence Solutions (10-15%)

7Domain 6: Implement Generative AI Solutions (10-15%)

8Exam Review: Cross-Domain Topics and Advanced Practice

4.4 Azure AI Speech Services

Key Takeaways

Azure AI Speech Services

Speech-to-Text (STT)

Real-Time Transcription

Continuous Recognition

Audio Input Sources

Batch Transcription

Text-to-Speech (TTS)

Basic TTS

SSML (Speech Synthesis Markup Language)

Key SSML Elements

Speech Translation

Speaker Recognition

Custom Speech Models

When to Use Custom Speech

Training Data Types