6.3 CloudWatch, CloudTrail, and AWS Config
Key Takeaways
- Amazon CloudWatch monitors performance metrics, collects logs, sets alarms, and triggers automated actions — it answers 'how are my resources performing?'
- AWS CloudTrail records management and data API calls — it is the audit trail answering 'who did what, when, and from where?'
- AWS Config records resource configuration over time and evaluates rules — it answers 'is this resource configured correctly and compliant?'
- EC2 memory and disk-space utilization are NOT default CloudWatch metrics; you must install the CloudWatch agent to publish them.
- CloudWatch metric statistics are retained on a tiered schedule (high-resolution data ages out faster), and alarm actions include SNS, Auto Scaling, EC2 stop/terminate/reboot/recover, and Lambda.
Amazon CloudWatch — Performance and Observability
Quick Answer: CloudWatch = metrics + logs + alarms ("how is it performing?"). CloudTrail = API call history ("who did what?"). AWS Config = configuration history + compliance rules ("is it configured correctly?"). The exam loves one-line scenarios that hinge on exactly which of these three answers the question.
CloudWatch ingests metrics (time-series numbers), stores logs, and fires alarms that drive automation. Knowing which metrics exist by default is a recurring exam point.
| Component | What it does |
|---|---|
| Metrics | CPUUtilization, NetworkIn/Out, plus custom metrics published via the API |
| Logs | Centralize application and service logs into log groups/streams |
| Alarms | Watch a metric/expression and act when a threshold is breached |
| Dashboards | Visualize metrics and logs across Regions and accounts |
| Logs Insights | Purpose-built query language over log data |
| Synthetics | Canaries that probe URLs/APIs on a schedule |
| Contributor Insights | Identify top-N contributors to a spike |
Default vs. agent-required metrics
| Service | Default metrics | Requires CloudWatch agent |
|---|---|---|
| EC2 | CPUUtilization, NetworkIn/Out, DiskReadOps, StatusCheck | Memory utilization, disk free space, swap |
| RDS | CPU, FreeableMemory, DatabaseConnections, IOPS | (comprehensive by default) |
| Lambda | Invocations, Duration, Errors, Throttles | Custom business metrics via SDK |
| ALB | RequestCount, TargetResponseTime, HTTPCode_Target_5XX | (comprehensive by default) |
Critical trap: A scenario says "alarm when EC2 memory exceeds 90%." Memory is NOT a default hypervisor-visible metric — install the CloudWatch agent on the instance to publish it, then alarm on the custom metric.
Alarms and automated response
An alarm has three states: OK (within threshold), ALARM (breached), and INSUFFICIENT_DATA (not enough samples). Alarm actions include sending an SNS notification, triggering Auto Scaling, performing an EC2 action (stop, terminate, reboot, recover — recover rebuilds on healthy hardware preserving the instance ID and private IP), or invoking Lambda. Metric filters turn log patterns (such as counting ERROR lines) into metrics you can alarm on, and subscription filters stream logs in real time to Lambda, Kinesis, or OpenSearch.
CloudTrail and AWS Config — Audit and Compliance
AWS CloudTrail (the audit log)
CloudTrail records each API call with the identity (IAM principal), action, timestamp, source IP, and request/response details. By default, management events are viewable for 90 days in Event history at no cost; to retain longer or analyze, create a trail that delivers logs to S3.
| Event type | Examples | Notes |
|---|---|---|
| Management events | RunInstances, DeleteBucket, console sign-in | 90-day history free; trail for long-term |
| Data events | S3 GetObject/PutObject, Lambda Invoke | High volume; charged; off by default |
| Insights events | Unusual call-rate anomalies | Charged; flags spikes like mass deletes |
A single multi-Region trail captures every Region, and an organization trail captures all accounts in AWS Organizations. Log file integrity validation uses hashing to prove logs were not altered — vital for forensic and audit defensibility.
AWS Config (the configuration recorder)
Config continuously records resource configuration items and builds a per-resource timeline you can rewind. Config rules evaluate desired state and mark resources compliant or not; remediation can auto-fix via SSM Automation; an aggregator rolls up many accounts/Regions.
| Rule | What it enforces |
|---|---|
| encrypted-volumes | All EBS volumes are encrypted |
| s3-bucket-versioning-enabled | Buckets have versioning on |
| rds-instance-public-access-check | RDS is not publicly reachable |
| iam-root-access-key-check | Root has no access keys |
The three pillars side by side
| Dimension | CloudWatch | CloudTrail | AWS Config |
|---|---|---|---|
| Question | How is it performing? | Who did what? | Is it configured correctly? |
| Data | Metrics, logs, alarms | API call records | Config snapshots + history |
| Typical use | Alarm on CPU/memory | Investigate an incident | Enforce/auto-remediate compliance |
On the Exam: "Who terminated the instance at 2 a.m.?" → CloudTrail. "Alert when CPU > 80%" → CloudWatch. "Continuously verify every bucket has versioning and fix violations" → AWS Config rule + SSM remediation.
How the three pillars work together
The services are complementary, and strong architectures use all three. Imagine an unexpected production outage: CloudWatch alarms first detect the symptom (latency spiked, 5XX errors climbed), CloudTrail then reveals the cause (an engineer modified a security group at 1:58 a.m.), and AWS Config shows the exact before-and-after configuration on its resource timeline and flags that the change violated a compliance rule.
A common multi-account design centralizes all three: an organization CloudTrail writes to a dedicated logging account's S3 bucket, a CloudWatch cross-account dashboard aggregates metrics, and a Config aggregator rolls up compliance — giving security teams one authoritative view.
EventBridge and automated remediation
CloudWatch Events evolved into Amazon EventBridge, which reacts to events (including CloudTrail-recorded API calls and Config rule changes) and routes them to targets like Lambda, SNS, or Step Functions. This is the glue for event-driven remediation: when Config flags a public S3 bucket, an EventBridge rule can invoke a Lambda that blocks public access automatically. Distinguish the roles on the exam — CloudWatch alarms watch numeric metric thresholds, while EventBridge rules match event patterns and content.
A "when an unauthorized API call happens, trigger a workflow" requirement is EventBridge reacting to CloudTrail, not a CloudWatch alarm. Remember also that CloudWatch Logs retention defaults to never expire until you set it, which silently grows storage cost — a frequently tested cost-optimization detail.
A security team must determine which IAM principal deleted an S3 bucket last Tuesday and from what source IP. Which service provides this?
An operations team needs an alarm when EC2 memory utilization exceeds 90%. What must they do first?
A company must continuously verify that all EBS volumes are encrypted and automatically remediate any that are not. Which approach fits best?