9.7 Monitoring and Recovery Case Lab

Key Takeaways

  • A complete operations design combines metrics, logs, alerts, action groups, insight tools, backup, and site recovery.
  • Monitoring should prove that data is collected before alert rules and dashboards are treated as reliable.
  • Recovery design must separate restore requirements from failover requirements and test both.
  • Troubleshooting should move from symptom to signal, then from signal to configuration, then from configuration to remediation.
  • AZ-104 case studies often combine multiple services and require ordering the steps correctly.
Last updated: May 2026

Scenario

Contoso runs a three-tier application in Azure. The web tier uses two Azure VMs behind a load balancer. The application tier uses VMs in a private subnet. The data tier uses managed disks and a storage account for exports. The business requires operational visibility, alerting to the right team, VM backup, file-level restore, and regional disaster recovery for the application tier. Administrators must also troubleshoot intermittent storage authorization failures and a private connectivity issue.

This is an AZ-104-style monitoring and recovery case because no single feature solves everything. You need metrics for quick thresholds, logs for investigation, action groups for routing, processing rules for maintenance, insight tools for resource-specific views, Azure Backup for recovery points, and Site Recovery for regional failover.

Monitoring design

Start by mapping requirements to signals:

RequirementSignal or toolConfiguration
Alert when web VM CPU stays highMetric alertScope web VMs, average CPU, action group.
Detect missing VM agent reportsLog Analytics queryAzure Monitor Agent, DCR, Heartbeat table.
Investigate failed deploymentsActivity log queryActivity log available; route to workspace for KQL retention.
Investigate storage 403 errorsStorage resource logsDiagnostic setting to Log Analytics.
Troubleshoot app to data connectivityNetwork Watcher and Connection MonitorTest TCP path, next hop, NSG, DNS.
Monitor backup failuresBackup alerts and reportsConfigure vault monitoring and Log Analytics reporting.
Validate regional recoveryASR test failoverIsolated test network and recovery plan.

Implementation order matters. Create the Log Analytics workspace first. Enable diagnostic settings and VM data collection next. Validate data with simple KQL queries. Create alert rules only after data appears. Configure action groups and alert processing rules. Then build workbooks and operational views.

Workspace and diagnostic setup

Create a workspace in the operations subscription or in the application subscription depending on the organization's access model. For this lab, use a central workspace so the operations team can query across web, app, storage, network, backup, and ASR signals. Apply resource-context access if application teams should see only their own resource logs.

Enable diagnostic settings for the storage account and load balancer if supported categories are needed. Route activity log events to the workspace for longer KQL analysis. Enable VM insights on the VM tiers using Azure Monitor Agent and a data collection rule.

Validation queries:

AzureActivity
| where TimeGenerated > ago(24h)
| summarize Events=count() by ActivityStatusValue

Heartbeat
| where TimeGenerated > ago(1h)
| summarize LastHeartbeat=max(TimeGenerated) by Computer

StorageBlobLogs
| where TimeGenerated > ago(1h)
| take 10

If the third query returns no rows, do not create the storage log alert yet. Confirm the diagnostic setting, selected storage service categories, destination workspace, and time range.

Alert workflow

Create an action group named ag-platform-critical for severity 0 and 1 alerts. Include email for the operations group and a webhook or Logic App that creates incidents. Create a separate action group named ag-storage-ops for storage administrators. Use common alert schema for automation receivers.

Create metric alerts for web CPU and load balancer health probe status. Use log alerts for failed deployments, missing heartbeat, and storage 403 spikes. Add an alert processing rule for planned patch windows. Scope it to the VM resource group, suppress notifications for selected severities, and schedule it only for the approved maintenance window.

Example storage 403 alert query:

StorageBlobLogs
| where TimeGenerated > ago(15m)
| where StatusCode == 403
| summarize ForbiddenRequests=count() by bin(TimeGenerated, 5m)

The alert condition can evaluate the count and route to ag-storage-ops. If the requirement is to identify caller IPs in the notification, refine the query or use an automation action that runs a richer query:

StorageBlobLogs
| where TimeGenerated > ago(15m)
| where StatusCode == 403
| summarize ForbiddenRequests=count() by CallerIpAddress, AuthenticationType, OperationName
| order by ForbiddenRequests desc

Troubleshooting scenario 1: storage failures

Users report export failures after a SAS token rotation. Metrics show transactions are normal but availability dipped briefly. Storage logs show 403 responses from two build agents using an old token. The administrator should not resize the storage account or change replication first. The correct remediation is to update the SAS configuration or access method, invalidate old usage where appropriate, and consider managed identity or stored access policies if they better meet operational control requirements.

Troubleshooting tree:

  1. Confirm the failure time from user reports and application logs.
  2. Check storage metrics for availability and response type changes.
  3. Query resource logs for 401 and 403 details.
  4. Group by caller IP, authentication type, operation, and user agent where available.
  5. Match failures to the SAS rotation timeline.
  6. Fix credentials or authorization and monitor the alert until clear.

Troubleshooting scenario 2: private connectivity

The app tier intermittently cannot reach a private endpoint used by the data tier. VM CPU and application logs do not explain the issue. Use Network Watcher next hop from the app VM to the private endpoint IP to confirm route behavior. Use IP flow verify for the app subnet and destination port. Use DNS lookup from the app VM to confirm the service FQDN resolves to the private IP. Use Connection Monitor to track failures over time.

If DNS sometimes returns a public address, inspect private DNS zone links and conditional forwarders. If next hop points to a virtual appliance unexpectedly, inspect UDRs and firewall logs. If IP flow verify denies traffic, inspect NSG priority and effective security rules. If the path works but the service rejects traffic, inspect private endpoint approval, service firewall, and identity.

Backup design

Protect all production VMs with Azure Backup in a Recovery Services vault. Configure vault redundancy according to recovery requirements before enabling protection. Use a policy with daily backups and retention that matches business rules. If the data tier requires more aggressive RPO than daily backup, use an additional data-specific protection strategy; do not pretend daily VM backup meets a one-hour RPO.

Backup decision table:

NeedDesign choiceValidation
Restore deleted fileVM file recoveryTest file recovery from recent point.
Restore corrupted VMVM or disk restoreTest alternate-location restore.
Protect backup dataSoft delete and immutability where requiredReview vault security settings.
Central reportingBackup Center and Log Analytics reportsConfirm failed jobs appear.
Regional app continuityASR, not backup aloneRun test failover.

Site Recovery design

Enable ASR for the application tier VMs to a paired or approved recovery region. Configure target VNets, subnets, disk type, cache storage, and target VM size. Build a recovery plan that starts dependencies before application services. Run test failover into an isolated network and validate DNS, authentication, application startup, and connectivity to required data services.

ASR decision table:

DecisionCorrect approachCommon mistake
Validate DRTest failoverPerforming unplanned failover as a drill.
Coordinate tiersRecovery planStarting all VMs randomly.
Avoid duplicate production trafficIsolated test networkConnecting test failover to production subnet.
Return protection after failoverReprotectAssuming replication continues automatically in reverse.
Meet file restore needAzure BackupUsing ASR as a file restore product.

Final exam workflow

When a case study asks for monitoring and recovery steps, order them by dependency. You cannot alert on logs that are not collected. You cannot trust backup without a successful restore test. You cannot claim disaster recovery without a successful ASR test failover and recovery plan validation. You cannot diagnose private endpoint failures from CPU metrics.

A strong administrator answer says: collect the correct signal, validate the data, alert with the right rule type, route through the right action group, suppress during maintenance with processing rules, use insights and Network Watcher for resource-specific diagnosis, back up for restore, and use Site Recovery for failover. That sequence is the core of this chapter and the level AZ-104 expects.

Test Your Knowledge

In the case lab, which implementation step should occur before creating a log alert for storage 403 responses?

A
B
C
D
Test Your Knowledge

A private endpoint connectivity issue might be caused by DNS, routes, or NSGs. Which toolset best supports the investigation?

A
B
C
D
Test Your Knowledge

Which pairing correctly matches the recovery requirement to the Azure feature?

A
B
C
D