Which evaluation approach is strongest before approving a customer-facing AI assistant?

A repeatable test set, quality rubric, safety tests, latency and cost checks, and human review for high-risk outputs. Production approval needs evidence across quality, safety, latency, cost, and review requirements. A single demo is not enough.

What is the purpose of red-team testing for a generative AI application?

To intentionally probe misuse, prompt injection, unsafe content, data leakage, and policy bypass attempts. Red-team testing searches for failures and abuse paths before broader rollout. It reduces risk but does not remove the need for monitoring.

Which output should most likely require human review before use?

A final recommendation that could affect a customer's financial or legal outcome. High-impact outputs that affect customers or regulated outcomes need stronger controls, including human review and clear escalation paths.

Model Evaluation, Human Review, and Red-Team | Free Guide 2026

Evaluation Is A Lifecycle, Not A One-Time Score

A foundation model application should not be approved because one demo looks impressive. Evaluation asks whether the system performs well enough for the intended audience, data, risk, cost, and workflow. It compares model outputs to a defined rubric and gathers evidence before broader rollout.

For AWS AI Practitioner study, evaluation should be understood at a practical level. Amazon Bedrock includes model evaluation capabilities that can help compare models for supported tasks. Teams may also create their own evaluation sets, use human reviewers, review CloudWatch metrics, and collect user feedback. The point is to create a repeatable decision process.

Evaluation area	Example question	Evidence source
Task quality	Did the output answer the user need?	Human rubric, labeled examples, acceptance tests.
Grounding	Did the answer use approved context only?	Source review, RAG logs, citation checks.
Safety	Did it avoid harmful, biased, or restricted content?	Guardrail tests, red-team prompts, policy review.
Reliability	Does it behave consistently across cases?	Regression test set and repeated runs.
Latency	Is response time acceptable for the workflow?	Application metrics and user testing.
Cost	Is token and retrieval cost sustainable?	Usage reports, budgets, cost allocation tags.

Evaluation starts with a test set. A support assistant might need common questions, rare issues, angry customer messages, missing data, restricted policy cases, and prompt injection attempts. A document summarizer might need short, long, messy, and sensitive documents. A sales drafting tool might need multiple customer segments and compliance language.

A rubric makes review less subjective. Instead of asking whether the answer is good, ask whether it is accurate, complete, grounded, concise, policy-compliant, and actionable. Each category can be scored by human reviewers. The team should also define automatic failure conditions, such as exposing personal data, inventing policy, giving unsafe advice, or taking an action without approval.

Human review is not only for training data. It is a production control. Amazon Augmented AI can support human review workflows for ML predictions, and organizations can also build approval steps into their applications. Human review is especially important when outputs affect customers, regulated processes, safety, finance, employment, medical, legal, or security decisions.

Human review checklist:

Define which outputs require review before use.
Train reviewers on the rubric and escalation rules.
Capture reviewer feedback in a structured way.
Separate harmless edits from serious safety or grounding failures.
Use feedback to update prompts, retrieval sources, guardrails, or workflow limits.
Monitor reviewer workload so review does not become a hidden bottleneck.

Red-team testing is adversarial testing. It tries to make the system fail before real users or attackers do. For a generative AI application, red-team prompts may request secrets, ask the model to ignore instructions, inject malicious instructions through retrieved documents, demand restricted content, or try to manipulate an agent into taking unauthorized actions.

Prompt injection is a key risk. A user or document may contain instructions such as ignore the previous rules or reveal hidden system instructions. The practitioner does not need to implement security controls, but should recognize that prompt injection is a governance and design issue. The response may include strict instruction hierarchy, input filtering, retrieval controls, guardrails, least privilege, and human approval for sensitive actions.

Bias and fairness also belong in evaluation. If an AI workflow influences recommendations, triage, prioritization, or customer treatment, the team should test whether outputs differ unfairly across groups. SageMaker Clarify is an AWS service associated with bias detection and explainability in ML workflows. At practitioner level, know when fairness review is needed and when a business owner should be involved.

Production feedback closes the loop. Users should be able to flag poor responses. Logs and metrics should show failure patterns, latency spikes, cost changes, and safety events. CloudWatch can help monitor operational metrics, while CloudTrail can support audit visibility for AWS API activity. Evaluation should be repeated after prompt, model, retrieval, guardrail, or policy changes.

A strong approval packet is practical: use case, audience, model choice, data sources, prompt template, guardrails, evaluation results, known limitations, human review path, cost estimate, monitoring plan, and rollback plan. This is the kind of evidence a non-builder stakeholder should ask for before approving production AI.

AWS AI Practitioner Study Guide

5.5 Model Evaluation, Human Review, and Red-Team Feedback

Key Takeaways

Evaluation Is A Lifecycle, Not A One-Time Score

AWS AI Practitioner Study Guide

1Chapter 1: AIF-C01 Orientation and Official Source Control

2Chapter 2: AI/ML Foundations and Use-Case Fit

3Chapter 3: ML Lifecycle, Metrics, and Practitioner MLOps

4Chapter 4: Generative AI Foundations and Inference Concepts

5Chapter 5: Prompting, Model Selection, Customization, and Evaluation

6Chapter 6: Amazon Bedrock, RAG, Agents, and Guardrails

7Chapter 7: AWS Managed AI/ML Services and SageMaker Map

8Chapter 8: Responsible AI, Human Review, and Safety

9Chapter 9: Security, Compliance, Governance, and Cost Controls

10Chapter 10: Integrated AWS AI Business Scenario Labs

11Chapter 11: Final Review, Exam Readiness, and Recertification

5.5 Model Evaluation, Human Review, and Red-Team Feedback

Key Takeaways

Evaluation Is A Lifecycle, Not A One-Time Score