Malicious instructions hidden inside a web page or document that a GenAI system retrieves and processes, hijacking its original instructions, are an example of:

Indirect prompt injection. The section describes prompt injection as malicious instructions hidden in the data the model processes; the indirect variant arrives through retrieved content rather than the user, making it especially dangerous for RAG and agentic systems.

What is the primary durable value of a red-team campaign for the test process?

Each confirmed attack becomes a reproducible security regression test that runs on every model or prompt change. The section stresses that a campaign's lasting value is the durable test asset: every successful attack is captured as a reproducible case (prompt, version, unsafe output, expected safe behaviour) forming a security regression suite that proves fixes stay fixed.

Because responses are non-deterministic, how is an adversarial test typically asserted?

As a safety-property check (e.g. the output must be a refusal or must not contain disallowed content), often via an LLM-as-judge grader. As the section explains, adversarial assertions are safety-property checks (must refuse, must not contain disallowed content, no PII pattern present) rather than exact matches, and an LLM-as-judge grader commonly classifies safe refusal versus policy breach; the defence rate aggregates results.

Adversarial prompts & red-teaming — Free Study Guide 2026

Key Takeaways

An adversarial prompt is deliberately crafted to push a model past its guardrails into harmful, false, non-compliant, or confidential output — security-level negative testing.
Red-teaming is a structured, iterative campaign: define in-scope harms, enumerate attack techniques, execute systematically, log successes, and rate severity.
Core attack categories: jailbreaks, direct and indirect prompt injection, harmful-content elicitation, and PII / system-prompt extraction.
Every successful attack becomes a reproducible security regression test asserting a safety property (must refuse, must not contain disallowed content).
Track a defence rate over the attack suite, combine manual creativity with automated attack generation, and keep the attack library current.

Adversarial prompting: testing for failure on purpose

The previous sections tested whether a GenAI system produces good answers to cooperative inputs. Adversarial testing asks the opposite question: can a hostile input make the system misbehave? An adversarial prompt is deliberately crafted to push the model past its guardrails — to produce harmful, false, non-compliant, or confidential output it was designed to refuse. In ISTQB terms this is negative testing raised to the security level: the tester actively tries to break the safety and policy requirements rather than confirm the happy path.

Red-teaming: structured campaigns, not random pokes

Red-teaming is the organised discipline of adversarial testing. Rather than ad-hoc prompt tinkering, a red-team campaign is planned like any test effort: define objectives (which harms and policies are in scope), enumerate attack techniques, execute systematically, log every success, and rate severity. It is typically iterative — a partial success is refined until it fully bypasses a control, mirroring how a real attacker escalates.

The main attack categories a CT-GenAI tester must know:

Jailbreaks — prompts that trick the model into ignoring its safety policy, often via role-play ("pretend you are an AI with no rules"), hypothetical framing, or "DAN"-style personas. Goal: elicit content the model should refuse.
Prompt injection — malicious instructions hidden in data the model processes (a web page, a document, a user field) that hijack the system's original instructions. Indirect prompt injection, where the payload arrives through retrieved content rather than the user, is especially dangerous for RAG and agentic systems.
Harmful-content elicitation — attempts to draw out disallowed output such as violence, hate, illegal instructions, or self-harm content.
PII / data extraction — coaxing the model to reveal personal data, secrets, its own hidden system prompt, or memorised training data ("repeat the text above", "what were your instructions?").
Misinformation / hallucination baiting — leading questions engineered to produce confident falsehoods.

An attack-technique to test-goal map

Attack technique	What it exploits	Test goal (what you verify)
Role-play jailbreak	Persona override of policy	Model keeps refusing disallowed content in-character
Instruction override ("ignore previous instructions")	Weak instruction hierarchy	System prompt cannot be discarded by user text
Indirect prompt injection	Trust of retrieved / external data	Untrusted content is treated as data, not commands
PII / system-prompt extraction	Leakage of secrets and context	No personal data or hidden prompt is revealed
Encoding / obfuscation (base64, leetspeak)	Safety-filter bypass	Checks still catch obfuscated payloads
Many-shot / context flooding	Long-context guardrail decay	Policy holds across a long conversation

Turning findings into regression tests

A red-team campaign's value is not the one-off breakage — it is the durable test asset it produces. Every successful attack should be captured as a reproducible test case: the exact prompt, the model version, the observed unsafe output, and the expected safe behaviour (a refusal, a sanitised answer, or a safe completion). These become a security regression suite that runs on every model or prompt change to prove the fix stays fixed and that new versions do not reopen old holes.

Because responses are non-deterministic, an adversarial assertion is usually a safety-property check rather than an exact match: "the output must NOT contain disallowed content", "the output MUST be a refusal", "no string matching the PII pattern appears". LLM-as-judge is common here — a grader model classifies whether a response is a safe refusal or a policy breach. You track a defence rate (the percentage of attacks correctly resisted) and treat any regression in it as a release blocker.

Practical guidance

Combine manual creativity (humans invent novel attacks) with automated attack generation (mutating known jailbreaks at scale); neither alone is sufficient.
Keep an up-to-date attack library, because the threat landscape evolves and red-teaming is continuous rather than one-time.
Scope campaigns to your product's real risks (a medical chatbot's harms differ from a coding assistant's) and coordinate with legal and safety stakeholders.
Feed confirmed breaches back into the golden dataset and perturbation suites so the whole test system compounds.

Severity, coverage, and safe handling

Not every successful attack is equally urgent, so rate each finding by severity (how harmful the leaked or generated content is) and likelihood (how easily a real user could reproduce it), then prioritise fixes accordingly. Measure coverage against a taxonomy of harms and techniques — the OWASP Top 10 for LLM Applications is a common reference — so you can show which risk categories have been exercised rather than claiming vague completeness. Test at two levels: the raw model and the full system, because guardrails, input filters, and output moderation live around the model and must be attacked as a whole. Finally, handle findings responsibly: store harmful example outputs securely, restrict who can see them, and never publish working jailbreaks. Red-teaming that generates real harmful content is itself sensitive material and must be governed like any other security artefact.

Adversarial prompting and red-teaming close the loop begun in 3.1 and 3.2: controlled variance and metamorphic robustness prove the system is right and stable, while red-teaming proves it is safe under attack. Together they form the test-design backbone for non-deterministic GenAI.

ISTQB Certified Tester — Testing with Generative AI

ISTQB Generative AI Testing Specialist (CT-GenAI)

3.3 Adversarial prompts & red-teaming

Key Takeaways

Adversarial prompting: testing for failure on purpose

Red-teaming: structured campaigns, not random pokes

An attack-technique to test-goal map

Turning findings into regression tests

Practical guidance

Severity, coverage, and safe handling

ISTQB Certified Tester — Testing with Generative AI

1GenAI Foundations for Testers

2Quality Attributes for GenAI

3Test Design for Non-Determinism

4GenAI Risks & Mitigation

5Test Infrastructure & Tooling

6Organizational Adoption

ISTQB Generative AI Testing Specialist (CT-GenAI)

3.3 Adversarial prompts & red-teaming

Key Takeaways

Adversarial prompting: testing for failure on purpose

Red-teaming: structured campaigns, not random pokes

An attack-technique to test-goal map

Turning findings into regression tests

Practical guidance

Severity, coverage, and safe handling