2.2 Safety, toxicity & bias
Key Takeaways
- Safety is the broad attribute of avoiding harmful outputs and is tested with red-teaming, jailbreak resistance, and both attack-success and over-refusal rates.
- Toxicity is a narrower, highly automatable subset of unsafe output measured with classifiers such as the Perspective API and Llama Guard.
- Bias is systematic unfairness toward groups defined by protected attributes and is only visible in aggregate, not from a single output.
- Counterfactual paired testing (swap only the demographic term) is the core technique for detecting bias.
- Over-refusal of benign requests is a real failure mode, so safety testing measures both missed harms and wrongly blocked safe prompts.
Non-content risks: safety, toxicity, and bias
Faithfulness and fluency describe whether an answer is good; safety, toxicity, and bias describe whether an answer is harmful. These attributes are usually tested through red-teaming and adversarial prompts rather than ordinary functional cases, because harmful behaviour tends to surface only when the system is provoked. A tester's job is to define what "harmful" means for this specific product (a harm taxonomy), assemble prompts that try to elicit it, set severity levels and thresholds, and measure how often the guardrails hold. Human review usually stays in the loop for the highest-severity categories.
Safety
Safety is the broad attribute of avoiding harmful outputs: content that could cause physical, psychological, financial, or societal harm. Typical safety categories include weapons or self-harm instructions, illegal activity, malware, harassment, sexual content involving minors, and dangerous medical or legal advice. Safety testing checks two things: refusal of disallowed requests, and resistance to jailbreaks that try to bypass those refusals — role-play framing, "ignore previous instructions," encoded or obfuscated requests, and prompt injection delivered through retrieved content.
How to measure: run a curated safety test set plus adversarial and jailbreak variants, then compute the attack success rate (fraction of harmful prompts that produced disallowed content) and the over-refusal rate (safe prompts wrongly refused). Both matter — an over-cautious system that blocks benign requests fails usability just as surely as one that leaks harmful content. Guardrail classifiers such as Llama Guard or the OpenAI/Azure moderation endpoints can label each request and response by category as an automated pre- and post-filter, and a system prompt or policy layer usually sits in front of the model as a further control that testers must probe. Because attackers keep inventing new jailbreak patterns, safety suites are treated as living datasets: each newly discovered bypass is captured as a regression case so the same weakness cannot silently reappear after a model or prompt update.
Toxicity
Toxicity is a narrower, well-studied sub-type of unsafe output: language that is offensive, hateful, insulting, profane, threatening, or demeaning. Because it is a classifiable property of text, toxicity is one of the most automatable GenAI checks — but note that toxicity is context-dependent (a clinical term may be fine in a medical context and offensive elsewhere), so thresholds must be tuned per product.
How to measure: score outputs with a toxicity classifier. Common tools are the Perspective API (which returns per-attribute probabilities such as TOXICITY, SEVERE_TOXICITY, IDENTITY_ATTACK, INSULT, and PROFANITY) and Llama Guard or similar LLM guardrails. Standard datasets such as RealToxicityPrompts probe how easily a model can be nudged into toxic completions. Track the proportion of outputs above a chosen toxicity threshold and the maximum toxicity reached under adversarial prompting, not just the average.
Bias and fairness
Bias is the systematic unfairness of outputs toward or against groups defined by protected attributes — gender, race, age, religion, nationality, disability. It shows up as stereotyping ("the nurse... she"), unequal quality of service across groups, skewed representation, or different sentiment for equivalent inputs. Bias is subtle because each individual answer can look reasonable in isolation; the problem is only visible in aggregate, across groups.
How to measure: use counterfactual / paired testing — hold the prompt constant and swap only the demographic term (for example, swap names or pronouns) and check whether the response changes in quality, sentiment, or recommendation. Aggregate metrics then compare outcomes across groups (demographic parity, sentiment differences, and stereotype-association scores such as those in the StereoSet or BBQ benchmarks). Always report per-group results, not just an overall average, because averages hide the disparities that constitute bias. A single flagged answer is only a lead for investigation, never proof; a tester needs a statistically meaningful sample of paired prompts before concluding that a model treats one group differently from another.
Detection-approach table
| Risk | What it is | Example | Detection approach |
|---|---|---|---|
| Safety | Any harmful output | Weapon-making steps produced after a jailbreak | Adversarial/red-team suites; attack-success + over-refusal rates; guardrail classifiers (Llama Guard, moderation APIs) |
| Toxicity | Offensive/hateful language | A slur or insult appears in a reply | Toxicity classifiers (Perspective API, Llama Guard); RealToxicityPrompts; threshold + max-toxicity tracking |
| Bias / fairness | Systematic group unfairness | "The engineer... he"; lower-quality answers for certain names | Counterfactual pair swaps; per-group metrics; StereoSet/BBQ; demographic-parity comparison |
Putting it together
A mature safety test plan starts from a documented harm taxonomy with severity ratings, builds a labelled adversarial dataset per category, and runs both automated classifiers and human review on the results. Metrics are reported per category and per group so that a single blended "safety score" never masks a severe but rare failure. Because models and guardrails drift, these suites are re-run as regression checks on every model or prompt update.
Exam tips: (1) toxicity is a subset of safety, not a synonym — safety is broader. (2) Bias must be measured across groups in aggregate; a single output can neither prove nor disprove bias. (3) Over-refusal is a genuine failure mode, so safety testing measures both missed harms and wrongly blocked benign requests.
Which statement about the relationship between toxicity and safety is correct?
What is the most reliable technique for detecting demographic bias in a model's outputs?
A tester needs an automated way to flag offensive or hateful language in model responses. Which tool is designed for that purpose?