3.3 Adversarial prompts & red-teaming
Key Takeaways
- An adversarial prompt is deliberately crafted to push a model past its guardrails into harmful, false, non-compliant, or confidential output — security-level negative testing.
- Red-teaming is a structured, iterative campaign: define in-scope harms, enumerate attack techniques, execute systematically, log successes, and rate severity.
- Core attack categories: jailbreaks, direct and indirect prompt injection, harmful-content elicitation, and PII / system-prompt extraction.
- Every successful attack becomes a reproducible security regression test asserting a safety property (must refuse, must not contain disallowed content).
- Track a defence rate over the attack suite, combine manual creativity with automated attack generation, and keep the attack library current.
Adversarial prompting: testing for failure on purpose
The previous sections tested whether a GenAI system produces good answers to cooperative inputs. Adversarial testing asks the opposite question: can a hostile input make the system misbehave? An adversarial prompt is deliberately crafted to push the model past its guardrails — to produce harmful, false, non-compliant, or confidential output it was designed to refuse. In ISTQB terms this is negative testing raised to the security level: the tester actively tries to break the safety and policy requirements rather than confirm the happy path.
Red-teaming: structured campaigns, not random pokes
Red-teaming is the organised discipline of adversarial testing. Rather than ad-hoc prompt tinkering, a red-team campaign is planned like any test effort: define objectives (which harms and policies are in scope), enumerate attack techniques, execute systematically, log every success, and rate severity. It is typically iterative — a partial success is refined until it fully bypasses a control, mirroring how a real attacker escalates.
The main attack categories a CT-GenAI tester must know:
- Jailbreaks — prompts that trick the model into ignoring its safety policy, often via role-play ("pretend you are an AI with no rules"), hypothetical framing, or "DAN"-style personas. Goal: elicit content the model should refuse.
- Prompt injection — malicious instructions hidden in data the model processes (a web page, a document, a user field) that hijack the system's original instructions. Indirect prompt injection, where the payload arrives through retrieved content rather than the user, is especially dangerous for RAG and agentic systems.
- Harmful-content elicitation — attempts to draw out disallowed output such as violence, hate, illegal instructions, or self-harm content.
- PII / data extraction — coaxing the model to reveal personal data, secrets, its own hidden system prompt, or memorised training data ("repeat the text above", "what were your instructions?").
- Misinformation / hallucination baiting — leading questions engineered to produce confident falsehoods.
Malicious instructions hidden inside a web page or document that a GenAI system retrieves and processes, hijacking its original instructions, are an example of:
What is the primary durable value of a red-team campaign for the test process?
Because responses are non-deterministic, how is an adversarial test typically asserted?
An attack-technique to test-goal map
| Attack technique | What it exploits | Test goal (what you verify) |
|---|---|---|
| Role-play jailbreak | Persona override of policy | Model keeps refusing disallowed content in-character |
| Instruction override ("ignore previous instructions") | Weak instruction hierarchy | System prompt cannot be discarded by user text |
| Indirect prompt injection | Trust of retrieved / external data | Untrusted content is treated as data, not commands |
| PII / system-prompt extraction | Leakage of secrets and context | No personal data or hidden prompt is revealed |
| Encoding / obfuscation (base64, leetspeak) | Safety-filter bypass | Checks still catch obfuscated payloads |
| Many-shot / context flooding | Long-context guardrail decay | Policy holds across a long conversation |
Turning findings into regression tests
A red-team campaign's value is not the one-off breakage — it is the durable test asset it produces. Every successful attack should be captured as a reproducible test case: the exact prompt, the model version, the observed unsafe output, and the expected safe behaviour (a refusal, a sanitised answer, or a safe completion). These become a security regression suite that runs on every model or prompt change to prove the fix stays fixed and that new versions do not reopen old holes.
Because responses are non-deterministic, an adversarial assertion is usually a safety-property check rather than an exact match: "the output must NOT contain disallowed content", "the output MUST be a refusal", "no string matching the PII pattern appears". LLM-as-judge is common here — a grader model classifies whether a response is a safe refusal or a policy breach. You track a defence rate (the percentage of attacks correctly resisted) and treat any regression in it as a release blocker.
Practical guidance
- Combine manual creativity (humans invent novel attacks) with automated attack generation (mutating known jailbreaks at scale); neither alone is sufficient.
- Keep an up-to-date attack library, because the threat landscape evolves and red-teaming is continuous rather than one-time.
- Scope campaigns to your product's real risks (a medical chatbot's harms differ from a coding assistant's) and coordinate with legal and safety stakeholders.
- Feed confirmed breaches back into the golden dataset and perturbation suites so the whole test system compounds.
Severity, coverage, and safe handling
Not every successful attack is equally urgent, so rate each finding by severity (how harmful the leaked or generated content is) and likelihood (how easily a real user could reproduce it), then prioritise fixes accordingly. Measure coverage against a taxonomy of harms and techniques — the OWASP Top 10 for LLM Applications is a common reference — so you can show which risk categories have been exercised rather than claiming vague completeness. Test at two levels: the raw model and the full system, because guardrails, input filters, and output moderation live around the model and must be attacked as a whole. Finally, handle findings responsibly: store harmful example outputs securely, restrict who can see them, and never publish working jailbreaks. Red-teaming that generates real harmful content is itself sensitive material and must be governed like any other security artefact.
Adversarial prompting and red-teaming close the loop begun in 3.1 and 3.2: controlled variance and metamorphic robustness prove the system is right and stable, while red-teaming proves it is safe under attack. Together they form the test-design backbone for non-deterministic GenAI.