3.1 Language and Speech Services
Key Takeaways
- Azure AI Language is the service to map to written text analysis, including language detection, named entity recognition, PII detection, key phrase extraction, sentiment analysis, and summarization.
- Azure AI Speech is the service to map to audio input or audio output, including speech to text, text to speech, speech translation, language identification, pronunciation assessment, and speaker-related scenarios.
- Azure Translator is the service-specific answer for text and document translation, while speech translation belongs under Azure AI Speech.
- For AI-901 implementation scenarios, recognize the Foundry path: test in a portal or studio surface, create or use a Foundry resource, then call the service through an SDK, REST API, or tool.
- A spoken-prompt solution may use Azure Speech or a deployed multimodal model depending on whether the task is transcription/synthesis or direct audio-aware reasoning.
Service Choice Starts With The Input
AI-901 does not expect you to train a language model from scratch. It expects you to recognize when a ready-made Azure AI service is the better fit than a generic chat model. Start every scenario by asking what the user gives the app and what the app must return.
If the input is written text, think Azure AI Language or Azure Translator. If the input or output is audio, think Azure AI Speech. If the request is an open-ended prompt that includes audio or needs a conversational model response, a deployed multimodal model in Foundry may also appear in the answer, but the service distinction still matters.
Language, Translation, And Speech Map
| Scenario signal | Best Azure capability | What it returns |
|---|---|---|
| Find people, places, organizations, dates, or PII in text | Azure AI Language NER or PII detection | Typed entities and offsets |
| Pull main concepts from emails or tickets | Azure AI Language key phrase extraction | Important terms and phrases |
| Score reviews as positive, neutral, negative, or mixed | Azure AI Language sentiment analysis | Sentiment labels and scores |
| Shorten a document or conversation transcript | Azure AI Language summarization | Extractive or abstractive summary |
| Translate written content between languages | Azure Translator | Target-language text or translated document |
| Turn audio into a transcript or caption stream | Azure AI Speech speech to text | Text transcript, often with timing |
| Read a generated response aloud | Azure AI Speech text to speech | Synthesized audio |
| Translate spoken audio in real time | Azure AI Speech speech translation | Translated text and optionally speech |
| Verify or identify a person by voice traits | Azure AI Speech speaker recognition | Speaker identity or verification result |
Azure AI Language
Azure AI Language is a cloud service for natural language processing. Core capabilities include language detection, named entity recognition, PII detection, custom named entity recognition, and text analytics for health. Microsoft also lists established capabilities such as key phrase extraction, sentiment analysis, custom text classification, conversational language understanding, question answering, and summarization. The exam still names key phrase extraction, entity detection, sentiment analysis, and summarization explicitly, so study those even if a documentation page labels some as legacy or established.
A practical text analysis app usually follows this flow:
- Create or select an Azure AI Language or Foundry resource.
- Decide whether a prebuilt feature is enough or a custom project is needed.
- Test sample text in Microsoft Foundry or Language Studio where available.
- Call the REST API or client library from a lightweight app.
- Protect sensitive input and output, especially when detecting PII or health information.
Use prebuilt features when the categories are standard. Use custom named entity recognition or custom text classification when your business needs labels that the prebuilt model does not understand, such as internal product codes, contract clauses, or support-ticket categories.
Azure AI Speech
Azure AI Speech covers speech to text, text to speech, speech translation, language identification, pronunciation assessment, and related voice scenarios. The exam wording often hides the answer in the direction of conversion. Speech to text means audio becomes written words. Text to speech means written words become spoken audio. Speech translation means spoken language is translated, with text or synthesized speech as the output.
Speech also has implementation choices. Real-time transcription fits live captions, voice commands, and interactive meetings. Batch transcription fits a backlog of recordings. Text to speech uses neural voices and can be adjusted with Speech Synthesis Markup Language for pronunciation, pitch, rate, and volume. Containers or sovereign-cloud options may appear in production discussions, but AI-901 usually only needs the basic service fit.
Exam Process
Use this decision process in scenarios:
- Is the source written text, audio, or both?
- Does the app need analysis, translation, transcription, synthesis, or speaker identity?
- Is the desired output a label, extracted value, summary, translation, transcript, or audio file?
- Does a prebuilt service solve it, or does the business need custom labels or a multimodal model?
- After choosing, connect it through Foundry, Speech Studio, a REST API, or an SDK and apply responsible AI controls.
The common trap is mixing similar words. Speech recognition is not speaker recognition. Translation is not summarization. Entity recognition extracts typed items from text; it does not read text from an image. Keep the data direction clear and the service choice becomes straightforward.
A support team stores chat transcripts and wants an app to mask customer account numbers, identify product names, and produce a short case recap. Which Azure service is the best starting point?
A travel kiosk must listen to a spoken question in one language and play back the answer in another language. Which capability combination is most relevant?