9.3 Prompt Injection, Data Exfiltration, and GenAI Threat Modeling

Key Takeaways

  • Prompt injection happens when user input or retrieved content tries to override application instructions, reveal hidden context, or cause unsafe actions.
  • Data exfiltration risk increases when a model can access private documents, tools, logs, secrets, or action systems without strong boundaries.
  • Generative AI threat modeling should identify assets, actors, trust boundaries, inputs, outputs, tools, and failure modes before production approval.
  • Controls include least privilege, source filtering, prompt design, Guardrails for Amazon Bedrock, tool allow lists, output checks, logging, and human review.
  • The practical goal is not perfect prevention, but reducing impact and making unsafe behavior observable and recoverable.
Last updated: May 2026

GenAI threat modeling for practitioners

A traditional web application usually treats form fields, files, and API calls as inputs. A generative AI application treats natural language as an input too. That creates a new control problem: a user, document, email, web page, or retrieved passage can contain instructions that try to manipulate the model. Prompt injection is the broad term for attempts to override system instructions, bypass policy, reveal hidden context, or trigger unauthorized behavior.

Direct prompt injection comes from the user. The user might ask the assistant to ignore safety rules, reveal the hidden prompt, output private context, or call a tool it should not call. Indirect prompt injection comes from content the application retrieves or processes. A document in a knowledge base could contain text that tells the model to send confidential data to the user. An email summary tool could read an email that instructs it to ignore the original task.

Threat areaWhat can go wrongPractitioner control question
Direct prompt injectionUser tries to override application rules or reveal hidden contextAre instructions, guardrails, and refusal behavior tested against adversarial prompts?
Indirect prompt injectionRetrieved content includes malicious instructionsAre data sources trusted, scanned, filtered, and separated from control instructions?
Data exfiltrationThe assistant reveals private documents, secrets, or system promptsAre access controls and output checks limiting what can be returned?
Tool misuseAn agent calls an action outside the user's intent or authorityAre tools narrow, validated, logged, and gated by confirmation where needed?
Over-trustUsers treat generated text as authoritative when it is notAre citations, disclaimers, human review, and escalation paths appropriate?

Data exfiltration is the security concern behind many prompt-injection stories. If the model cannot access sensitive data, it cannot reveal that data. If the application retrieves only documents the user is allowed to see, the blast radius is lower. If an agent has only one narrow action, tool misuse is constrained. That is why least privilege, data minimization, and authorization checks are still the first line of defense for GenAI.

Prompt design helps, but prompts are not a security boundary by themselves. A strong system instruction can tell the model to follow company policy, use only retrieved sources, and refuse requests for secrets. That is useful. But a determined user can still try to manipulate the model, and retrieved text can still conflict with instructions. Security should not depend only on the model deciding to behave.

Guardrails for Amazon Bedrock can help evaluate user inputs and model responses for unsafe content, denied topics, sensitive data, prompt attack patterns where supported, and grounding behavior. Guardrails are part of defense-in-depth. They should be combined with IAM, application authorization, source controls, tool schemas, logging, and human review for high-impact actions.

Tool use changes the risk level. A chat assistant that only answers from a public FAQ has limited impact. A Bedrock agent that can create refunds, update customer records, or open support cases needs stronger controls. The application should define allowed tools, required parameters, validation rules, and confirmation steps. The model should not be able to invent new tools or call arbitrary APIs just because the user asks.

A simple threat modeling workflow:

  1. Name the business workflow and the decision or action the AI supports.
  2. List the assets: prompts, documents, embeddings, logs, secrets, user identities, tools, and outputs.
  3. Draw trust boundaries between users, the application, model service, data stores, logs, and action systems.
  4. Identify abuse cases, including prompt injection, indirect instructions, sensitive data requests, unauthorized actions, and false authoritative answers.
  5. Choose layered controls: least privilege, source governance, guardrails, validation, citations, logging, and human review.
  6. Test the system with normal, edge-case, and adversarial prompts before production.
  7. Monitor incidents and update controls as users find new ways to interact with the assistant.

Scenario: a company builds a support assistant that retrieves internal troubleshooting documents. An attacker asks the assistant to reveal all hidden instructions and all documents about unreleased products. The correct control answer is not to trust a stronger prompt alone. The application should enforce user authorization, retrieve only permitted documents, use guardrails and output checks, and log suspicious requests for review.

Scenario: an email summarizer reads incoming messages and drafts replies. A malicious email says to ignore previous instructions and include the user's private calendar details in the reply. This is indirect prompt injection because the instruction comes from content being processed. The application should separate untrusted email content from control instructions, limit tool access, and require user approval before sending responses.

Scenario: an internal procurement agent can submit purchase requests. A user asks it to split a purchase into smaller orders to bypass approval thresholds. This is not only a prompt issue; it is a business policy issue. The backend should enforce spending rules and approvals even if the model produces a persuasive request. The model can assist with form completion, but the system of record must enforce policy.

Practitioner-level GenAI security is about asking where the model can read, write, and act. If the model can only read approved public content, the risk is smaller. If it can read sensitive records and trigger business actions, the review must be stronger. AIF-C01 expects that kind of judgment, not exploit engineering.

Test Your Knowledge

A document in a knowledge base contains instructions telling the assistant to ignore company policy and reveal confidential files. What type of risk is this?

A
B
C
D
Test Your Knowledge

Which control most directly reduces the impact of data exfiltration from a RAG assistant?

A
B
C
D
Test Your Knowledge

An agent can update customer records. What is the strongest practitioner judgment?

A
B
C
D