9.7 Monitoring and Recovery Case Lab

Key Takeaways

A complete operations design combines metrics, logs, alerts, action groups, insight tools, backup, and site recovery.
Monitoring should prove that data is collected before alert rules and dashboards are treated as reliable.
Recovery design must separate restore requirements from failover requirements and test both.
Troubleshooting should move from symptom to signal, then from signal to configuration, then from configuration to remediation.
AZ-104 case studies often combine multiple services and require ordering the steps correctly.

Last updated: May 2026

Scenario

Contoso runs a three-tier application in Azure. The web tier uses two Azure VMs behind a load balancer. The application tier uses VMs in a private subnet. The data tier uses managed disks and a storage account for exports. The business requires operational visibility, alerting to the right team, VM backup, file-level restore, and regional disaster recovery for the application tier. Administrators must also troubleshoot intermittent storage authorization failures and a private connectivity issue.

This is an AZ-104-style monitoring and recovery case because no single feature solves everything. You need metrics for quick thresholds, logs for investigation, action groups for routing, processing rules for maintenance, insight tools for resource-specific views, Azure Backup for recovery points, and Site Recovery for regional failover.

Monitoring design

Start by mapping requirements to signals:

Requirement	Signal or tool	Configuration
Alert when web VM CPU stays high	Metric alert	Scope web VMs, average CPU, action group.
Detect missing VM agent reports	Log Analytics query	Azure Monitor Agent, DCR, Heartbeat table.
Investigate failed deployments	Activity log query	Activity log available; route to workspace for KQL retention.
Investigate storage 403 errors	Storage resource logs	Diagnostic setting to Log Analytics.
Troubleshoot app to data connectivity	Network Watcher and Connection Monitor	Test TCP path, next hop, NSG, DNS.
Monitor backup failures	Backup alerts and reports	Configure vault monitoring and Log Analytics reporting.
Validate regional recovery	ASR test failover	Isolated test network and recovery plan.

Implementation order matters. Create the Log Analytics workspace first. Enable diagnostic settings and VM data collection next. Validate data with simple KQL queries. Create alert rules only after data appears. Configure action groups and alert processing rules. Then build workbooks and operational views.

Workspace and diagnostic setup

Create a workspace in the operations subscription or in the application subscription depending on the organization's access model. For this lab, use a central workspace so the operations team can query across web, app, storage, network, backup, and ASR signals. Apply resource-context access if application teams should see only their own resource logs.

Enable diagnostic settings for the storage account and load balancer if supported categories are needed. Route activity log events to the workspace for longer KQL analysis. Enable VM insights on the VM tiers using Azure Monitor Agent and a data collection rule.

Validation queries:

AzureActivity
| where TimeGenerated > ago(24h)
| summarize Events=count() by ActivityStatusValue

Heartbeat
| where TimeGenerated > ago(1h)
| summarize LastHeartbeat=max(TimeGenerated) by Computer

StorageBlobLogs
| where TimeGenerated > ago(1h)
| take 10

If the third query returns no rows, do not create the storage log alert yet. Confirm the diagnostic setting, selected storage service categories, destination workspace, and time range.

Alert workflow

Create an action group named ag-platform-critical for severity 0 and 1 alerts. Include email for the operations group and a webhook or Logic App that creates incidents. Create a separate action group named ag-storage-ops for storage administrators. Use common alert schema for automation receivers.

Create metric alerts for web CPU and load balancer health probe status. Use log alerts for failed deployments, missing heartbeat, and storage 403 spikes. Add an alert processing rule for planned patch windows. Scope it to the VM resource group, suppress notifications for selected severities, and schedule it only for the approved maintenance window.

Example storage 403 alert query:

StorageBlobLogs
| where TimeGenerated > ago(15m)
| where StatusCode == 403
| summarize ForbiddenRequests=count() by bin(TimeGenerated, 5m)

The alert condition can evaluate the count and route to ag-storage-ops. If the requirement is to identify caller IPs in the notification, refine the query or use an automation action that runs a richer query:

StorageBlobLogs
| where TimeGenerated > ago(15m)
| where StatusCode == 403
| summarize ForbiddenRequests=count() by CallerIpAddress, AuthenticationType, OperationName
| order by ForbiddenRequests desc

Troubleshooting scenario 1: storage failures

Users report export failures after a SAS token rotation. Metrics show transactions are normal but availability dipped briefly. Storage logs show 403 responses from two build agents using an old token. The administrator should not resize the storage account or change replication first. The correct remediation is to update the SAS configuration or access method, invalidate old usage where appropriate, and consider managed identity or stored access policies if they better meet operational control requirements.

Troubleshooting tree:

Confirm the failure time from user reports and application logs.
Check storage metrics for availability and response type changes.
Query resource logs for 401 and 403 details.
Group by caller IP, authentication type, operation, and user agent where available.
Match failures to the SAS rotation timeline.
Fix credentials or authorization and monitor the alert until clear.

Troubleshooting scenario 2: private connectivity

The app tier intermittently cannot reach a private endpoint used by the data tier. VM CPU and application logs do not explain the issue. Use Network Watcher next hop from the app VM to the private endpoint IP to confirm route behavior. Use IP flow verify for the app subnet and destination port. Use DNS lookup from the app VM to confirm the service FQDN resolves to the private IP. Use Connection Monitor to track failures over time.

If DNS sometimes returns a public address, inspect private DNS zone links and conditional forwarders. If next hop points to a virtual appliance unexpectedly, inspect UDRs and firewall logs. If IP flow verify denies traffic, inspect NSG priority and effective security rules. If the path works but the service rejects traffic, inspect private endpoint approval, service firewall, and identity.

Backup design

Protect all production VMs with Azure Backup in a Recovery Services vault. Configure vault redundancy according to recovery requirements before enabling protection. Use a policy with daily backups and retention that matches business rules. If the data tier requires more aggressive RPO than daily backup, use an additional data-specific protection strategy; do not pretend daily VM backup meets a one-hour RPO.

Backup decision table:

Need	Design choice	Validation
Restore deleted file	VM file recovery	Test file recovery from recent point.
Restore corrupted VM	VM or disk restore	Test alternate-location restore.
Protect backup data	Soft delete and immutability where required	Review vault security settings.
Central reporting	Backup Center and Log Analytics reports	Confirm failed jobs appear.
Regional app continuity	ASR, not backup alone	Run test failover.

Site Recovery design

Enable ASR for the application tier VMs to a paired or approved recovery region. Configure target VNets, subnets, disk type, cache storage, and target VM size. Build a recovery plan that starts dependencies before application services. Run test failover into an isolated network and validate DNS, authentication, application startup, and connectivity to required data services.

ASR decision table:

Decision	Correct approach	Common mistake
Validate DR	Test failover	Performing unplanned failover as a drill.
Coordinate tiers	Recovery plan	Starting all VMs randomly.
Avoid duplicate production traffic	Isolated test network	Connecting test failover to production subnet.
Return protection after failover	Reprotect	Assuming replication continues automatically in reverse.
Meet file restore need	Azure Backup	Using ASR as a file restore product.

Final exam workflow

When a case study asks for monitoring and recovery steps, order them by dependency. You cannot alert on logs that are not collected. You cannot trust backup without a successful restore test. You cannot claim disaster recovery without a successful ASR test failover and recovery plan validation. You cannot diagnose private endpoint failures from CPU metrics.

A strong administrator answer says: collect the correct signal, validate the data, alert with the right rule type, route through the right action group, suppress during maintenance with processing rules, use insights and Network Watcher for resource-specific diagnosis, back up for restore, and use Site Recovery for failover. That sequence is the core of this chapter and the level AZ-104 expects.

Test Your Knowledge

In the case lab, which implementation step should occur before creating a log alert for storage 403 responses?

Enable and validate storage resource logs in Log Analytics

Run ASR planned failover

Delete the action group

Change the VM availability zone

Test Your Knowledge

A private endpoint connectivity issue might be caused by DNS, routes, or NSGs. Which toolset best supports the investigation?

Network Watcher next hop, IP flow verify, DNS checks, and Connection Monitor

Only Azure Backup file recovery

Only a metric alert on CPU

Only a storage lifecycle rule

Test Your Knowledge

Which pairing correctly matches the recovery requirement to the Azure feature?

Deleted file: Azure Backup file recovery; regional failover: Azure Site Recovery

Deleted file: Alert processing rule; regional failover: Storage metrics

Deleted file: Connection Monitor; regional failover: KQL project

Deleted file: Action group; regional failover: Blob tags

Up Next

10.1 Domain Weight Triage and Study Prioritization

Chapter 10: AZ-104 Integration and Troubleshooting Case Labs

AZ-104 Microsoft Azure Administrator Study Guide

1Chapter 1: Exam Orientation and Microsoft Learn Source Control

2Chapter 2: Identity, RBAC, and Governance Foundations

3Chapter 3: Subscriptions, Policy, Costs, and Resource Organization

4Chapter 4: Storage Accounts, Access, and Data Protection

5Chapter 5: Compute, Virtual Machines, Scale Sets, and Bicep

6Chapter 6: Containers, App Service, and Platform Compute

7Chapter 7: Virtual Networking, Routing, and Secure Access

8Chapter 8: Name Resolution, Load Balancing, and Connectivity Troubleshooting

9Chapter 9: Monitoring, Alerting, Backup, and Site Recovery

10Chapter 10: AZ-104 Integration and Troubleshooting Case Labs

11Chapter 11: Final Review, Exam Experience, Renewal, and Career Path

9.7 Monitoring and Recovery Case Lab

Key Takeaways

Scenario

Monitoring design

Workspace and diagnostic setup

Alert workflow

Troubleshooting scenario 1: storage failures

Troubleshooting scenario 2: private connectivity

Backup design

Site Recovery design

Final exam workflow