4.1 Observability: CloudWatch & X-Ray

Key Takeaways

  • CloudWatch Logs stores text in log groups (one per app/function) and log streams (one per source); Logs Insights queries them with a purpose-built query language.
  • Embedded Metric Format (EMF) lets Lambda emit custom metrics inside structured logs with no extra PutMetricData call; CloudWatch extracts them automatically.
  • X-Ray annotations are indexed key-value pairs you can filter and search (max 50 per trace); metadata is not indexed and is for context only.
  • Enabling X-Ray active tracing on Lambda adds tracing of AWS SDK calls, but recording custom annotations/subsegments still requires instrumenting the code with the X-Ray SDK.
  • X-Ray sampling rules control which requests are traced (default: 1 request/second reservoir plus 5% of the rest) to cap cost and overhead.
Last updated: June 2026

Detect, Investigate, Trace

Observability is the heart of the 18% Troubleshooting and Optimization domain. Every scenario maps to one of three jobs: detect and react (Amazon CloudWatch metrics plus alarms), investigate root cause (CloudWatch Logs plus Logs Insights), or trace a single request across services (AWS X-Ray). Read the stem for the verb: "alert when," "find why," or "see which downstream call is slow."

CloudWatch metrics, custom metrics & EMF

A metric is a time-series of data points. AWS services publish them automatically. For AWS Lambda you get Invocations, Errors, Throttles, Duration, ConcurrentExecutions, and DeadLetterErrors with no setup. For app-specific values you publish custom metrics with the PutMetricData API. Standard metrics have 1-minute granularity; high-resolution metrics support down to 1-second granularity by setting StorageResolution: 1.

Metrics carry up to 30 dimensions (name/value pairs such as FunctionName or Environment=prod). Each unique combination of dimensions is a separate metric, so high-cardinality dimensions multiply cost fast — a trap the exam tests indirectly.

From Lambda the efficient pattern is Embedded Metric Format (EMF): write a structured JSON log line and CloudWatch automatically extracts the embedded metric. No second API call, no agent, no added latency. EMF is the recommended way to emit business counters (orders processed, cache misses) from serverless code.

CloudWatch Logs & Logs Insights

ConceptWhat it is
Log groupContainer, usually one per app or function (e.g. /aws/lambda/my-fn); retention is set here
Log streamSequence of events from one source within a group
Metric filterTurns a matching log pattern into a metric you can alarm on
Subscription filterStreams matching events to Kinesis, Firehose, or Lambda in real time
Logs InsightsQuery language (fields, filter, stats, parse, sort) across many groups

A quick Logs Insights query to count errors by minute: filter @message like /ERROR/ | stats count() by bin(1m). Note that Lambda needs logs:CreateLogGroup, logs:CreateLogStream, and logs:PutLogEvents on its execution role — a missing-logs scenario is almost always a permissions gap, not a code bug.

Alarms & Lambda Insights

A CloudWatch alarm evaluates one metric (or a metric math expression) against a threshold over a number of evaluation periods, then fires an action: an SNS notification, an Auto Scaling step, or an EC2 action. Lambda Insights is an enhanced-monitoring layer that reports system-level metrics (CPU, memory, network, cold-start init duration) per function, also via EMF.

AWS X-Ray

X-Ray records a trace as a request flows through services and assembles a service map (graph of nodes and edges). Each service emits a segment; downstream calls (DynamoDB, HTTP, SQS) become subsegments with granular timing, so you can pinpoint the slow hop.

FieldIndexed?Use
AnnotationYes (max 50 per trace)Filter/search traces by business value (e.g. customerTier)
MetadataNoStore rich context (objects, lists) for reading, not filtering

Sampling rules decide which requests are traced — the default reservoir is 1 trace per second plus 5% of additional requests — to cap cost and overhead. To instrument Lambda, flip on active tracing (a config toggle plus the AWSXRayDaemonWriteAccess policy); this records the function segment and traces AWS SDK calls. To add custom subsegments or annotations you must still call the X-Ray SDK (or the OpenTelemetry ADOT layer) in your code. Enabling active tracing alone never produces a custom paymentGateway subsegment.

Reading the service map

The service map colors each node by health: green for healthy, yellow for client-side 4xx faults, red for server-side 5xx errors, and a clock badge for throttling. When a request is slow, drill into the longest subsegment — if the DynamoDB subsegment shows 400 ms while the function ran 420 ms, the bottleneck is the data layer, not your code. This is the single most common "which downstream is slow" pattern the exam frames.

Choosing the right tool — worked scenario

Consider a checkout flow: API Gateway, a Lambda function, DynamoDB, and a third-party payment HTTP call. Three different failures map to three different tools:

  1. "Alert me when error rate exceeds 1%" — a CloudWatch alarm on the Lambda Errors/Invocations metric math, action to SNS.
  2. "Why did invocation abc123 throw at 14:02?" — Logs Insights query filtering the @requestId field across /aws/lambda/checkout.
  3. "Which step makes p99 latency 3 seconds?" — X-Ray trace and service map showing the payment HTTP subsegment dominating.

Mixing these up is the classic distractor: using Logs to "detect" (slow, no threshold) or using a metric to "trace" (no per-request path) both lose points.

Structured logging & correlation

Log as structured JSON, not free text, so Logs Insights can filter and stats on fields directly. Include a correlation ID (the API Gateway request ID or X-Ray trace ID) on every line so a single user request can be reconstructed across functions. CloudWatch Contributor Insights then ranks top talkers (e.g. the IPs or keys generating the most 5xx), and metric filters can convert a recurring error pattern into an alarmable metric without changing code.

Test Your Knowledge

A developer wants to filter X-Ray traces in the console to find only requests where a custom "customerTier" value equals "premium". Which X-Ray construct should the value be recorded as?

A
B
C
D
Test Your Knowledge

A Lambda function needs to emit a custom "OrdersProcessed" metric on every invocation with the lowest cost and no extra API calls. What is the recommended approach?

A
B
C
D
Test Your Knowledge

A team enabled X-Ray active tracing on a Lambda function but the traces show no custom "paymentGateway" subsegment they expected. Why?

A
B
C
D