3.6 Azure Monitor and Service Health

Key Takeaways

  • Azure Monitor is the umbrella platform that collects, analyzes, and acts on telemetry (metrics and logs) from Azure resources, applications, and on-premises systems.
  • Metrics are lightweight numeric data stored for 93 days; logs are richer records stored in a Log Analytics workspace and queried with Kusto Query Language (KQL).
  • Application Insights is the Application Performance Management (APM) feature of Azure Monitor for live web apps; Log Analytics is where you run KQL queries.
  • Alert rules define WHAT triggers an alert; action groups define WHO is notified and HOW; severity runs from Sev 0 (Critical) to Sev 4 (Verbose).
  • Service Health has three layers: Azure Status (global, all customers), Service Health (personalized to your subscriptions), and Resource Health (one specific resource).
Last updated: June 2026

Quick Answer: Azure Monitor is the full-stack telemetry platform. Metrics = fast numbers (93-day retention). Logs = detailed records you query with Kusto Query Language (KQL) in Log Analytics. Application Insights = live app performance. Service Health answers "is Azure broken?" at three zoom levels.

Azure Monitor: the telemetry umbrella

Azure Monitor automatically collects two fundamental data types the moment a resource is created — you do not deploy an agent to get them. Knowing the metrics-versus-logs split is the single most-tested idea in this section.

PillarWhat it isRetentionHow you read it
MetricsLightweight, time-series numbers (CPU %, request count, disk IOPS) sampled at one-minute granularity93 days by defaultMetrics Explorer, charts, autoscale rules
LogsVerbose, structured event records (traces, errors, sign-ins)Configurable, 30 days up to 2 years (longer with archive)KQL in a Log Analytics workspace

A useful mental model: a metric tells you a number crossed a line; a log tells you the story of what happened. Autoscaling and most alerts fire on metrics because they are cheap and near-real-time; forensic troubleshooting uses logs.

Data sources Azure Monitor ingests

  • Application — your code's performance, via Application Insights
  • Guest OS — metrics/logs from inside a virtual machine (needs the Azure Monitor Agent)
  • Azure resource — platform metrics and resource (diagnostic) logs
  • Azure subscription — Activity Log and Service Health events
  • Azure tenant — Microsoft Entra ID (formerly Azure AD) audit and sign-in logs

Log Analytics and KQL

Log Analytics is the tool where you write queries against log data using Kusto Query Language (KQL). You will not be asked to write KQL on AZ-900, but a classic distractor pairs the word "query logs" with Application Insights. The exam answer for querying log data is always Log Analytics.

Application Insights

Application Insights is the Application Performance Management (APM) feature for live web apps and APIs. It surfaces:

  • Response times, failure rates, and exception stacks
  • Dependency tracking — calls out to SQL, REST APIs, and queues
  • Availability tests — synthetic ping tests from worldwide locations
  • Usage analytics — page views, sessions, and user funnels

Trap: Application Insights monitors your application code. Service Health monitors Microsoft's platform. If a question is about a slow checkout page, the answer is Application Insights, never Service Health.

Alerts: rule + action group

An Azure Monitor alert has two independent halves, and AZ-900 loves to test that separation.

ComponentAnswers the questionExample
Alert ruleWhat condition fires it?CPU > 90% for 5 minutes
Action groupWho is told and how?Email the on-call team, SMS the manager, run a Logic App
SeverityHow urgent?Sev 0 (Critical) → Sev 4 (Verbose)

Three alert types map to the data they watch: metric alerts (a number crosses a threshold), log alerts (a KQL query returns a result), and activity log alerts (a management action such as "VM deleted" occurs). Action groups can automate remediation — calling an Azure Function or Logic App — not just send notifications.

Service Health: three levels of zoom

Azure Service Health tells you when Azure itself has a problem. Picture a zoom lens going from the whole planet down to one disk.

LayerScopePersonalized?Where
Azure StatusEvery Azure service, every region, all customersNostatus.azure.com (public)
Service HealthOnly the services and regions you useYesAzure portal
Resource HealthOne specific resource (a single VM)YesThe resource's blade

Service Health (the middle layer) tracks four event kinds: service issues (active outages), planned maintenance, health advisories (action-required changes like a deprecation), and security advisories. You can configure Service Health alerts so an outage in your region emails the team automatically.

Resource Health reports four states: Available (healthy), Unavailable (a platform or non-platform event hit it), Degraded (reduced performance), and Unknown (no signal for 10+ minutes).

On the Exam: Memorize the zoom: Azure Status = global / everyone, Service Health = personalized to your subscriptions, Resource Health = one resource. If a question mentions a single named VM, the answer is Resource Health. If it mentions "any maintenance affecting my subscriptions," the answer is Service Health.

Worked example: choosing the right tool

Walk through a realistic decision tree, because AZ-900 phrases these as "a company wants to..." scenarios:

  1. "We want to auto-scale a virtual machine scale set when average CPU exceeds 70%." — This is numeric and near-real-time, so it uses metrics plus an autoscale rule, not logs.
  2. "We need to search six months of historical errors and correlate them across services." — Six months exceeds the 93-day metric window and needs rich querying, so the answer is logs in Log Analytics queried with KQL.
  3. "Email the on-call engineer when a metric breaches a threshold." — An alert rule detects the breach; an action group sends the email.
  4. "Is the slowdown our code or Microsoft's platform?" — Check Application Insights for the app side and Service Health / Resource Health for the platform side.

Notice that metrics, logs, alerts, Application Insights, and Service Health each own a distinct job; the exam rewards you for picking the narrowest tool that fully answers the scenario.

Common AZ-900 traps in this section

  • "Query logs" always maps to Log Analytics / KQL, never to Application Insights (which generates app telemetry but is not the generic log-query tool).
  • Service Health is not Azure Monitor. Service Health is about Microsoft's platform outages; Azure Monitor is about your resources and apps.
  • Metrics ≠ logs. Metrics are numbers kept 93 days; logs are records kept in a workspace for up to two years.
  • Action group vs alert rule. The rule is the condition; the group is the response. Questions often hand you one and ask for the other.

The AZ-900 exam — roughly 40–60 questions in 45 minutes, a passing score of 700 on a 1000-point scale, US$99, delivered through Pearson VUE for Microsoft — reliably includes at least one Monitor-versus-Service-Health discrimination item, so over-learn this split.

Test Your Knowledge

An e-commerce team reports that their live checkout web page is responding slowly and occasionally throwing exceptions. Which Azure tool should they use to diagnose response times, failures, and dependency calls?

A
B
C
D
Test Your Knowledge

Which Azure Service Health layer is personalized to only the services and regions that your subscriptions actually use, and tracks planned maintenance affecting you?

A
B
C
D
Test Your Knowledge

For how long does Azure Monitor retain platform metrics by default before you must export them for longer storage?

A
B
C
D