6.5 Bedrock Model Evaluation, Monitoring, and Human Feedback
Key Takeaways
- Bedrock evaluation helps compare models, prompts, knowledge bases, and RAG sources using automatic, LLM-as-judge, or human-review methods where supported.
- A useful evaluation dataset contains realistic prompts, expected answers or ground truth, edge cases, refusal cases, and business scoring criteria.
- Monitoring combines CloudWatch metrics, CloudTrail audit events, invocation logging decisions, application telemetry, and user feedback.
- Human feedback is most important when outputs affect customers, policy interpretation, safety, legal risk, money movement, or regulated decisions.
- Evaluation is continuous: model behavior, source data, prompts, guardrails, and user needs can drift after launch.
Evaluation before production
A model demo is not an evaluation. A demo shows that a prompt can produce a good answer once. An evaluation asks whether the model, prompt, retrieval design, and guardrails perform well enough across representative work. Amazon Bedrock evaluations can help assess Bedrock models, knowledge bases, and RAG sources. AWS documentation describes automatic evaluation, LLM-as-judge evaluation, RAG evaluation, and human worker evaluation options where supported. The practitioner goal is to choose the evaluation approach that matches risk.
Evaluation starts with a dataset. For a summarization workflow, the dataset might include transcripts and expected summary traits. For a RAG assistant, it should include user questions, expected retrieved passages, expected answers, and questions that should not be answered. For an agent, the test set should include missing parameters, authorization failures, unsafe requests, and downstream API errors. A dataset built only from happy paths will make almost any model look better than it is.
| Evaluation target | What to measure | Useful evidence |
|---|---|---|
| Base model | Quality, instruction following, latency, cost, tone, format reliability | Same prompt set across candidate models. |
| Prompt template | Consistency, refusal behavior, schema adherence, sensitivity to wording | Versioned prompts and regression tests. |
| Knowledge Base or RAG source | Retrieval relevance, answer correctness, citation usefulness, grounding | Ground truth passages and expected answers. |
| Guardrail | False positives, false negatives, blocked categories, user experience | Safe and adversarial test prompts. |
| Agent workflow | Correct action choice, parameter elicitation, confirmation, error handling | Agent traces and business-system logs. |
Automatic metrics can scale review, but they do not replace business judgment. LLM-based judging can compare responses or score helpfulness, relevance, or faithfulness, but the judge model is still a model. Human evaluation is slower and more expensive, yet it is valuable when domain experts must judge nuance. A benefits specialist, claims reviewer, technician, nurse, or compliance analyst may notice a flaw that an automated score misses.
RAG evaluation deserves special attention. If the answer is wrong, first ask whether the correct source was retrieved. A generator cannot reliably answer from missing evidence. Bedrock RAG evaluations can help compare retrieve-only behavior and retrieve-and-generate behavior. The results can guide changes to chunking, metadata filters, reranking, source cleanup, and prompt instructions. Do not switch to a larger model before checking retrieval quality.
Monitoring begins before launch. Amazon Bedrock integrates with CloudWatch for metrics and monitoring, CloudTrail for API activity auditing, and model invocation logging to CloudWatch Logs or Amazon S3 when enabled. Invocation logging can include request and response data, so the decision to enable it must consider sensitive information, retention, access, encryption, and incident response. Application teams should also capture business metrics such as deflection rate, user edits, approval rate, escalation rate, and complaint rate.
Monitoring and feedback checklist:
- Define success metrics and unacceptable failure modes before pilot launch.
- Store prompt, model, retrieval, and guardrail versions with each test result where feasible.
- Track latency, token usage, throttling, errors, and user-facing failure rates.
- Review samples of accepted, edited, rejected, blocked, and escalated outputs.
- Protect invocation logs and avoid collecting sensitive text that the business does not need.
- Use CloudTrail to audit who changed model, guardrail, knowledge base, or agent resources.
- Schedule reevaluation after model changes, prompt changes, source updates, and policy changes.
Scenario: a legal operations team wants contract clause summaries. Evaluation should include real but sanitized clause types, known gold-standard summaries, examples with missing context, and red-team prompts asking for legal conclusions outside the approved scope. Human reviewers should score accuracy and risk. Monitoring should track how often attorneys edit summaries and which clause types produce disagreement. A high edit rate is not just a UX issue; it may indicate a model or prompt mismatch.
Scenario: a customer support team pilots a RAG assistant. Metrics should include retrieval relevance, citation click-through, answer helpfulness, escalation reduction, average handle time, and complaint categories. If users mark answers as unhelpful, the team should inspect retrieved passages. The problem might be stale articles, poor metadata, conflicting policy pages, or a prompt that over-summarizes. Model replacement is only one possible remedy.
Human feedback loops should be structured. Letting users thumbs-up or thumbs-down can identify broad trends, but expert review should label the reason for failure: incorrect fact, unsupported claim, bad tone, missing citation, privacy concern, policy violation, or action error. Those labels guide whether to adjust data, retrieval, prompt, guardrail, model choice, or workflow design. Without failure categories, feedback becomes noise.
For AWS Skill Builder practice, compare two models or two prompt variants with the same small rubric. Record which answers are correct, which are unsupported, and which are too slow or expensive. The habit to build is repeatability. If a stakeholder asks why one model was chosen, the team should have evidence beyond personal preference.
A team chooses a model because it answered one demo prompt well. What is the main concern?
A RAG assistant gives wrong answers. What should be checked before simply switching to a larger model?
Which monitoring choice requires special privacy and retention review because it can capture model inputs and outputs?