1.3 Content Safety and Governance
Key Takeaways
- Azure AI Content Safety and Microsoft Foundry guardrails classify harmful content such as hate, sexual content, violence, and self-harm across severity levels for text and images.
- Foundry guardrails can inspect user input, final output, and, for Foundry Agent Service preview scenarios, tool calls and tool responses.
- Prompt Shields address direct user prompt attacks and indirect attacks hidden in documents or other grounded content.
- Optional controls such as groundedness, protected material, PII detection, blocklists, and task adherence help manage risks beyond the four core harm categories.
- Governance requires testing, assignment to deployments or agents, monitoring, incident review, and human ownership; filters alone do not make an AI solution responsible.
From Principles To Controls
Responsible AI principles tell you what should be protected. Content safety and governance tell you how a Foundry solution starts enforcing those protections. On AI-901, expect scenario questions that ask which control belongs in a generative app, a user-upload workflow, or an agent that can call tools.
Microsoft Foundry guardrails are named collections of controls. A control defines the risk to detect, where to inspect the interaction, and what action to take. Azure AI Content Safety supplies classification models that help flag harmful content, while Foundry applies those controls to model deployments and, in preview, to agents built in Foundry Agent Service.
Core Harm Categories
| Risk category | What it covers in exam terms | Common scenario |
|---|---|---|
| Hate and fairness | Attacks, discriminatory language, harassment, or identity-based abuse | Moderating comments before publishing |
| Sexual | Adult sexual content, exploitation, explicit sexual material, or related abuse | Filtering image uploads in a community app |
| Violence | Threats, weapons, extremist content, graphic harm, or instructions for injury | Blocking unsafe generated instructions |
| Self-harm | Suicide, self-injury, eating-disorder harm, or encouragement of self-harm | Detecting crisis language in user text |
| Task adherence | Agent behavior that drifts from user instructions or task objectives | Checking whether an agent's tool use matches the user's intent |
Microsoft's Foundry severity documentation describes safe, low, medium, and high levels. The practical exam point is that a threshold controls what gets flagged or blocked. A stricter policy catches more borderline content but can create false positives; a looser policy reduces blocking but may miss risky material. Content at the safe level may still appear in annotations but is not the target of blocking.
Input, Output, And Agent Intervention Points
For basic model calls, the two most important checkpoints are user input and output. User input is the prompt or request sent to the model. Output is the completion returned to the user. A support chatbot, for example, should screen both the customer's message and the generated answer.
Agents add more risk because they can call tools. Foundry guardrails support preview intervention points for tool calls and tool responses in Foundry Agent Service. That matters when an agent can search documents, call APIs, or trigger actions. A harmful or manipulated tool result should not silently become the agent's final answer.
Controls Beyond Basic Harm Filtering
Foundry and Azure OpenAI safety features include more than the four core harm categories:
- Prompt Shields for user prompt attacks detect attempts to override system instructions, change the assistant's role, or bypass safety rules.
- Prompt Shields for indirect attacks detect malicious instructions embedded in documents, emails, webpages, or other external content the model uses for grounding.
- Groundedness detection helps flag answers that are not supported by the source materials provided to the model.
- Protected material detection helps identify known text or code that a model might reproduce too closely.
- PII detection helps identify personally identifiable information in generated content.
- Blocklists let teams add custom terms or patterns for their application context.
These controls map directly to AI-901 scenarios. A model producing unsupported policy answers points to groundedness. A document with hidden instructions points to indirect attack protection. A public content app that must block violent or sexual images points to content safety moderation.
Governance Process For A Foundry App
Use this process when reasoning through exam cases:
- Define the allowed use. State what the app or agent should and should not do.
- Identify risks. Include harmful content, prompt attacks, PII, protected material, hallucination, and misuse of tools.
- Choose intervention points. Inspect user input, output, and agent tool activity when applicable.
- Set actions. Decide whether each control should annotate, block, or annotate and block.
- Assign controls. Apply the guardrail or content filter to the model deployment or agent that needs it.
- Test in a non-production path. Use the playground, adversarial prompts, edge cases, and expected false-positive examples.
- Monitor and review. Log incidents, tune thresholds, and keep a human owner accountable for policy changes.
Exam Framing
Do not treat content filters as magic. Microsoft documentation notes that application design and API configuration affect filtering behavior, and preview features can change. A responsible architecture combines guardrails with good prompts, least-privilege access, grounding, clear user disclosures, human review, and monitoring.
The strongest AI-901 answer usually names the specific risk and the specific control. "Use responsible AI" is too vague. "Apply a Foundry guardrail that checks user input and output for violence and self-harm, then test it in the playground before production" is the kind of concrete reasoning the exam rewards.
A Foundry agent answers questions from uploaded supplier documents. One document contains hidden instructions telling the agent to ignore its policy and send confidential customer records to an outside address. Which safety control is most relevant?
A RAG-based helpdesk app gives an answer that sounds confident but is not supported by the policy documents retrieved for the user. Which optional guardrail capability is intended to help flag this risk?