Which statement about the relationship between toxicity and safety is correct?

Toxicity is a narrower subset of safety and is highly automatable via classifiers. The section states that toxicity is a narrower, well-studied sub-type of unsafe output and one of the most automatable checks (via classifiers like the Perspective API and Llama Guard), whereas safety is the broader attribute of avoiding all harmful outputs.

What is the most reliable technique for detecting demographic bias in a model's outputs?

Holding the prompt constant, swapping only the demographic term, and comparing outcomes across groups. The section explains that bias is only visible in aggregate, so counterfactual/paired testing (swap only the demographic term and compare across groups) is the core technique. A single output cannot establish bias, and the other options measure unrelated attributes.

A tester needs an automated way to flag offensive or hateful language in model responses. Which tool is designed for that purpose?

Perspective API. As described in the section, the Perspective API is a toxicity classifier returning per-attribute probabilities (TOXICITY, INSULT, PROFANITY, etc.). RAGAS scores faithfulness, latency benchmarks measure speed, and redaction pipelines address privacy — none of them detect toxic language.

Safety, toxicity & bias — Free Study Guide 2026

Non-content risks: safety, toxicity, and bias

Faithfulness and fluency describe whether an answer is good; safety, toxicity, and bias describe whether an answer is harmful. These attributes are usually tested through red-teaming and adversarial prompts rather than ordinary functional cases, because harmful behaviour tends to surface only when the system is provoked. A tester's job is to define what "harmful" means for this specific product (a harm taxonomy), assemble prompts that try to elicit it, set severity levels and thresholds, and measure how often the guardrails hold. Human review usually stays in the loop for the highest-severity categories.

Safety

Safety is the broad attribute of avoiding harmful outputs: content that could cause physical, psychological, financial, or societal harm. Typical safety categories include weapons or self-harm instructions, illegal activity, malware, harassment, sexual content involving minors, and dangerous medical or legal advice. Safety testing checks two things: refusal of disallowed requests, and resistance to jailbreaks that try to bypass those refusals — role-play framing, "ignore previous instructions," encoded or obfuscated requests, and prompt injection delivered through retrieved content.

How to measure: run a curated safety test set plus adversarial and jailbreak variants, then compute the attack success rate (fraction of harmful prompts that produced disallowed content) and the over-refusal rate (safe prompts wrongly refused). Both matter — an over-cautious system that blocks benign requests fails usability just as surely as one that leaks harmful content. Guardrail classifiers such as Llama Guard or the OpenAI/Azure moderation endpoints can label each request and response by category as an automated pre- and post-filter, and a system prompt or policy layer usually sits in front of the model as a further control that testers must probe. Because attackers keep inventing new jailbreak patterns, safety suites are treated as living datasets: each newly discovered bypass is captured as a regression case so the same weakness cannot silently reappear after a model or prompt update.

Toxicity

Toxicity is a narrower, well-studied sub-type of unsafe output: language that is offensive, hateful, insulting, profane, threatening, or demeaning. Because it is a classifiable property of text, toxicity is one of the most automatable GenAI checks — but note that toxicity is context-dependent (a clinical term may be fine in a medical context and offensive elsewhere), so thresholds must be tuned per product.

How to measure: score outputs with a toxicity classifier. Common tools are the Perspective API (which returns per-attribute probabilities such as TOXICITY, SEVERE_TOXICITY, IDENTITY_ATTACK, INSULT, and PROFANITY) and Llama Guard or similar LLM guardrails. Standard datasets such as RealToxicityPrompts probe how easily a model can be nudged into toxic completions. Track the proportion of outputs above a chosen toxicity threshold and the maximum toxicity reached under adversarial prompting, not just the average.

Bias and fairness

Bias is the systematic unfairness of outputs toward or against groups defined by protected attributes — gender, race, age, religion, nationality, disability. It shows up as stereotyping ("the nurse... she"), unequal quality of service across groups, skewed representation, or different sentiment for equivalent inputs. Bias is subtle because each individual answer can look reasonable in isolation; the problem is only visible in aggregate, across groups.

How to measure: use counterfactual / paired testing — hold the prompt constant and swap only the demographic term (for example, swap names or pronouns) and check whether the response changes in quality, sentiment, or recommendation. Aggregate metrics then compare outcomes across groups (demographic parity, sentiment differences, and stereotype-association scores such as those in the StereoSet or BBQ benchmarks). Always report per-group results, not just an overall average, because averages hide the disparities that constitute bias. A single flagged answer is only a lead for investigation, never proof; a tester needs a statistically meaningful sample of paired prompts before concluding that a model treats one group differently from another.

Detection-approach table

Risk	What it is	Example	Detection approach
Safety	Any harmful output	Weapon-making steps produced after a jailbreak	Adversarial/red-team suites; attack-success + over-refusal rates; guardrail classifiers (Llama Guard, moderation APIs)
Toxicity	Offensive/hateful language	A slur or insult appears in a reply	Toxicity classifiers (Perspective API, Llama Guard); RealToxicityPrompts; threshold + max-toxicity tracking
Bias / fairness	Systematic group unfairness	"The engineer... he"; lower-quality answers for certain names	Counterfactual pair swaps; per-group metrics; StereoSet/BBQ; demographic-parity comparison

Putting it together

A mature safety test plan starts from a documented harm taxonomy with severity ratings, builds a labelled adversarial dataset per category, and runs both automated classifiers and human review on the results. Metrics are reported per category and per group so that a single blended "safety score" never masks a severe but rare failure. Because models and guardrails drift, these suites are re-run as regression checks on every model or prompt update.

Exam tips: (1) toxicity is a subset of safety, not a synonym — safety is broader. (2) Bias must be measured across groups in aggregate; a single output can neither prove nor disprove bias. (3) Over-refusal is a genuine failure mode, so safety testing measures both missed harms and wrongly blocked benign requests.

ISTQB Certified Tester — Testing with Generative AI

ISTQB Generative AI Testing Specialist (CT-GenAI)

2.2 Safety, toxicity & bias

Key Takeaways

Non-content risks: safety, toxicity, and bias

Safety

Toxicity

Bias and fairness

Detection-approach table

Putting it together

ISTQB Certified Tester — Testing with Generative AI

1GenAI Foundations for Testers

2Quality Attributes for GenAI

3Test Design for Non-Determinism

4GenAI Risks & Mitigation

5Test Infrastructure & Tooling

6Organizational Adoption

ISTQB Generative AI Testing Specialist (CT-GenAI)

2.2 Safety, toxicity & bias

Key Takeaways

Non-content risks: safety, toxicity, and bias

Safety

Toxicity

Bias and fairness

Detection-approach table

Putting it together