9.4 Logging, Monitoring, CloudTrail, CloudWatch, and Config
Key Takeaways
- CloudTrail records AWS API activity and helps answer who did what, when, from where, and against which AWS resource.
- CloudWatch supports operational visibility through metrics, logs, dashboards, alarms, and service-specific signals such as errors, throttles, and latency where available.
- AWS Config records resource configuration history and evaluates resources against rules, which helps with drift detection and compliance monitoring.
- AI logging must balance troubleshooting value with privacy risk because prompts, responses, retrieved context, and traces can contain sensitive data.
- A practitioner should ask which events are logged, who reviews them, where alerts go, how long logs are retained, and whether logs support audit and incident response.
Observability as a governance control
An AI application should leave enough evidence to answer basic operational and security questions. Who changed model access? Which application role invoked a model? Did an agent call a fulfillment function? Are requests failing because of throttling? Did a bucket policy drift from the approved standard? Logging and monitoring make those questions answerable. Without them, governance becomes a policy document with no feedback loop.
AWS CloudTrail records AWS API activity for an account. It helps teams investigate who made a call, what action was requested, when it happened, where it came from, and which resource was involved. For AI scenarios, CloudTrail can help track administrative and service API activity such as changes to Bedrock resources, IAM policies, S3 access, Lambda actions, or SageMaker AI resources. It is not a replacement for application-level logs, but it is a core audit source.
Amazon CloudWatch provides metrics, logs, dashboards, and alarms. An AI team might use CloudWatch metrics to watch invocation counts, latency, errors, throttles, and downstream service health where the service publishes those signals. Application logs can capture request IDs, user IDs, guardrail outcomes, retrieval IDs, and tool execution status. Alarms can notify teams when errors spike, latency exceeds a threshold, or usage patterns look abnormal.
AWS Config records configuration changes for supported AWS resources and can evaluate them against rules. This is useful when governance depends on configuration state. For example, Config can help detect whether S3 buckets meet encryption expectations, whether public access settings changed, or whether resource configurations drift from approved baselines. It does not decide whether a model answer was correct; it helps monitor the AWS resource environment around the AI workload.
| Monitoring source | Best at answering | AI governance example |
|---|---|---|
| CloudTrail | Who called which AWS API and when? | Investigate who changed model access, IAM policies, or logging settings. |
| CloudWatch Metrics | Is the workload healthy and within expected behavior? | Watch invocation volume, errors, latency, throttles, and function failures. |
| CloudWatch Logs | What happened inside the application or service flow? | Review request IDs, retrieval status, guardrail outcomes, or agent tool calls. |
| AWS Config | Did resource configuration change or drift from policy? | Detect storage, network, or encryption configuration changes. |
| Service-specific reports | What did the AI service or security service detect? | Review Macie findings, Guardrails activity, or model invocation logs where enabled. |
The hardest logging decision in AI is content sensitivity. Prompts and responses are valuable for debugging hallucinations, policy failures, prompt injection attempts, and user complaints. They can also contain confidential information, personal data, secrets, or regulated content. A team should decide whether to log full content, metadata only, redacted content, or sampled content. The answer may differ between sandbox and production.
Model invocation logging, application traces, and agent traces should be treated as sensitive if they include prompt text, retrieved passages, intermediate reasoning traces, tool inputs, or response content. Those destinations need encryption, IAM controls, retention rules, and access review. A support engineer may need operational metadata, but not every engineer needs full customer transcripts or legal document excerpts.
Monitoring should include both technical and business signals. Technical signals include failed invocations, throttling, Lambda errors, vector store failures, timeouts, and access denied errors. Business signals include user escalation rates, blocked guardrail events, low-quality feedback, high-risk output categories, and unexpected action attempts. A practitioner should ask how the team will notice that the AI solution is not behaving as intended.
Incident response benefits from correlation IDs. If a user reports a harmful answer, the team should be able to trace the user session, model invocation, retrieved sources, guardrail result, tool call, and output if policy allows. If an agent created an incorrect ticket, the team should know which tool was called, with which parameters, by which application role, and whether the downstream system accepted the action.
Monitoring workflow for an AI release:
- Identify the questions logs must answer for audit, support, privacy, and incident response.
- Enable CloudTrail and protect trails according to organizational policy.
- Decide which application and model invocation details should be logged.
- Send operational metrics and errors to CloudWatch dashboards and alarms.
- Use AWS Config or equivalent controls to monitor configuration drift for key resources.
- Protect logs with encryption, IAM, retention, and review processes.
- Test an incident scenario before launch and confirm the team can reconstruct the event.
Scenario: a Bedrock agent starts returning access denied errors. CloudWatch logs might show failed Lambda calls or application errors. CloudTrail can show whether an IAM policy changed. AWS Config can show whether a resource configuration drifted. Together, these services help narrow the cause without guessing from the model response alone.
Scenario: a public assistant suddenly receives many prompt injection attempts. Application logs and Guardrails events can show blocked prompts or unusual input patterns. CloudWatch alarms can alert on volume spikes. CloudTrail can help confirm whether permissions were changed during the same period. The response may involve tightening guardrails, reviewing source documents, and checking whether any sensitive output was returned.
For AWS Skill Builder practice, do not treat CloudTrail, CloudWatch, and Config as interchangeable. CloudTrail is about API activity. CloudWatch is about metrics, logs, alarms, and operations. Config is about resource configuration history and compliance checks. Knowing that boundary is enough for many practitioner scenarios.
A security analyst asks who changed the IAM policy that allowed broader model invocation. Which AWS service is the best starting point?
An AI application has rising latency and throttling errors. Which service family is most relevant for metrics, logs, dashboards, and alarms?
A governance team wants to detect whether resource configurations drift from approved encryption and public-access settings. Which service is the best fit?