5.5 Model Evaluation, Human Review, and Red-Team Feedback
Key Takeaways
- Model evaluation compares candidate outputs against business quality, safety, cost, latency, and reliability criteria before production approval.
- Human review is essential when model outputs affect customers, regulated decisions, safety, financial outcomes, or brand risk.
- Red-team testing probes misuse, prompt injection, data leakage, unsafe content, hallucination, and policy bypass attempts.
- Evaluation is continuous because model versions, prompts, retrieval sources, user behavior, and business policies change over time.
Evaluation Is A Lifecycle, Not A One-Time Score
A foundation model application should not be approved because one demo looks impressive. Evaluation asks whether the system performs well enough for the intended audience, data, risk, cost, and workflow. It compares model outputs to a defined rubric and gathers evidence before broader rollout.
For AWS AI Practitioner study, evaluation should be understood at a practical level. Amazon Bedrock includes model evaluation capabilities that can help compare models for supported tasks. Teams may also create their own evaluation sets, use human reviewers, review CloudWatch metrics, and collect user feedback. The point is to create a repeatable decision process.
| Evaluation area | Example question | Evidence source |
|---|---|---|
| Task quality | Did the output answer the user need? | Human rubric, labeled examples, acceptance tests. |
| Grounding | Did the answer use approved context only? | Source review, RAG logs, citation checks. |
| Safety | Did it avoid harmful, biased, or restricted content? | Guardrail tests, red-team prompts, policy review. |
| Reliability | Does it behave consistently across cases? | Regression test set and repeated runs. |
| Latency | Is response time acceptable for the workflow? | Application metrics and user testing. |
| Cost | Is token and retrieval cost sustainable? | Usage reports, budgets, cost allocation tags. |
Evaluation starts with a test set. A support assistant might need common questions, rare issues, angry customer messages, missing data, restricted policy cases, and prompt injection attempts. A document summarizer might need short, long, messy, and sensitive documents. A sales drafting tool might need multiple customer segments and compliance language.
A rubric makes review less subjective. Instead of asking whether the answer is good, ask whether it is accurate, complete, grounded, concise, policy-compliant, and actionable. Each category can be scored by human reviewers. The team should also define automatic failure conditions, such as exposing personal data, inventing policy, giving unsafe advice, or taking an action without approval.
Human review is not only for training data. It is a production control. Amazon Augmented AI can support human review workflows for ML predictions, and organizations can also build approval steps into their applications. Human review is especially important when outputs affect customers, regulated processes, safety, finance, employment, medical, legal, or security decisions.
Human review checklist:
- Define which outputs require review before use.
- Train reviewers on the rubric and escalation rules.
- Capture reviewer feedback in a structured way.
- Separate harmless edits from serious safety or grounding failures.
- Use feedback to update prompts, retrieval sources, guardrails, or workflow limits.
- Monitor reviewer workload so review does not become a hidden bottleneck.
Red-team testing is adversarial testing. It tries to make the system fail before real users or attackers do. For a generative AI application, red-team prompts may request secrets, ask the model to ignore instructions, inject malicious instructions through retrieved documents, demand restricted content, or try to manipulate an agent into taking unauthorized actions.
Prompt injection is a key risk. A user or document may contain instructions such as ignore the previous rules or reveal hidden system instructions. The practitioner does not need to implement security controls, but should recognize that prompt injection is a governance and design issue. The response may include strict instruction hierarchy, input filtering, retrieval controls, guardrails, least privilege, and human approval for sensitive actions.
Bias and fairness also belong in evaluation. If an AI workflow influences recommendations, triage, prioritization, or customer treatment, the team should test whether outputs differ unfairly across groups. SageMaker Clarify is an AWS service associated with bias detection and explainability in ML workflows. At practitioner level, know when fairness review is needed and when a business owner should be involved.
Production feedback closes the loop. Users should be able to flag poor responses. Logs and metrics should show failure patterns, latency spikes, cost changes, and safety events. CloudWatch can help monitor operational metrics, while CloudTrail can support audit visibility for AWS API activity. Evaluation should be repeated after prompt, model, retrieval, guardrail, or policy changes.
A strong approval packet is practical: use case, audience, model choice, data sources, prompt template, guardrails, evaluation results, known limitations, human review path, cost estimate, monitoring plan, and rollback plan. This is the kind of evidence a non-builder stakeholder should ask for before approving production AI.
Which evaluation approach is strongest before approving a customer-facing AI assistant?
What is the purpose of red-team testing for a generative AI application?
Which output should most likely require human review before use?