8.7 Connectivity Troubleshooting Case Lab

Key Takeaways

  • Case labs require correlating symptoms across DNS, private endpoints, load balancers, NSGs, routes, and guest OS configuration.
  • The fastest path is to identify the failing scope: one client, one subnet, one backend, one protocol, or all users.
  • Use a hypothesis tree and prove each layer with a command or portal check before changing production settings.
  • Document the expected path and compare it to observed DNS, route, security, and application behavior.
  • AZ-104 case studies often include extra facts; focus on the recent change and the specific failing flow.
Last updated: May 2026

Lab scenario

You administer a hub-and-spoke Azure environment. The hub VNet contains Azure Firewall, a VPN gateway to the datacenter, and DNS forwarders. The spoke VNet contains two web VMs behind a public Standard Load Balancer, two API VMs behind an internal Standard Load Balancer, and a storage account accessed through a private endpoint. On-premises users connect over site-to-site VPN.

Three incidents arrive after a weekend change window:

  1. Internet users intermittently receive errors from www.contoso.com.
  2. Web VMs cannot connect to api.corp.internal by name, but one administrator says the API VMs are running.
  3. On-premises users cannot access the storage account by its normal blob FQDN after public network access was disabled.

Your job is to identify the most likely layer for each incident and choose diagnostics that prove the cause.

Build the expected path

Before testing, write the expected path. This is a professional habit and an exam advantage.

FlowExpected DNS resultExpected network pathMain controls
Internet to www.contoso.comPublic IP of public load balancerInternet to public frontend to healthy web backendPublic DNS, load balancer rule, probe, NSG, app listener
Web VM to api.corp.internalPrivate IP of internal load balancerSpoke subnet to internal frontend to healthy API backendPrivate DNS, load balancer rule, probe, NSG, routes
On-premises to storage blob FQDNPrivate endpoint IPVPN to hub or spoke path to private endpointConditional DNS forwarding, private DNS, VPN routes, firewalls, storage firewall

This table keeps you from confusing incidents. A public web error is not automatically related to the storage private endpoint. An API name failure is not solved by changing the public load balancer. Each flow has its own DNS answer, route, and security boundary.

Incident 1: public website intermittent errors

Intermittent public errors through a load balancer often mean some backends are healthy and others are not, session persistence is affecting distribution, or the application dependency fails on only one instance. Start by resolving the public name and confirming it points to the public load balancer frontend.

nslookup www.contoso.com
curl -v https://www.contoso.com/health
az network lb rule list -g rg-network --lb-name lb-web-public -o table

Then inspect backend health. If one web VM fails the probe, the load balancer should avoid it for new flows. If the probe is too shallow, it may mark a VM healthy even when the real app path fails. For example, a TCP probe on port 443 can pass while the app returns HTTP 500 because a local dependency is broken. An HTTP probe to /health gives a better signal when implemented correctly.

Troubleshooting tree:

Intermittent errors through public load balancer
|-- Does DNS point to the public frontend IP?
|   |-- No: fix public DNS record or TTL issue.
|   |-- Yes: continue.
|-- Are all backend probes healthy?
|   |-- No: inspect failed backend listener, path, NSG, and app logs.
|   |-- Yes: continue.
|-- Does each backend return the same app version and dependencies?
|   |-- No: fix deployment drift or dependency configuration.
|   |-- Yes: inspect client affinity, logs, and downstream services.

Do not remove the load balancer. Prove whether the issue follows a backend. Direct tests to each backend from a management subnet can reveal version drift or a stopped service.

Incident 2: internal API name failure

The phrase cannot connect by name is a clue. First test name resolution from a web VM, not from your laptop. The expected answer is the internal load balancer private frontend IP.

nslookup api.corp.internal
Test-NetConnection api.corp.internal -Port 8443
az network private-dns record-set a show -g rg-network -z corp.internal -n api

If DNS fails or returns an old IP, check the private DNS zone and VNet links. If DNS returns the internal load balancer IP, move to the load balancer rule and probe. The API load balancer should have a rule for the expected protocol and port, a backend pool with both API VMs, and a probe that matches the API health endpoint.

Use Network Watcher when Azure controls are suspect:

az network watcher test-ip-flow \
  -g rg-network \
  --vm vm-api-1 \
  --direction Inbound \
  --protocol TCP \
  --local 10.30.2.4:8443 \
  --remote 10.20.2.4:52000

az network watcher show-next-hop \
  -g rg-network \
  --vm vm-web-1 \
  --source-ip 10.20.2.4 \
  --dest-ip 10.30.1.10

If IP flow verify denies the app port, fix the NSG. If next hop points to a firewall when you expected direct VNet routing, check UDRs and firewall policy. If both are correct, test the app listener on each API VM.

Incident 3: on-premises storage private endpoint failure

This incident includes the recent change: public network access was disabled. That makes private endpoint DNS the prime suspect. From an on-premises client, the storage blob FQDN must resolve to the private endpoint IP, not a public address.

nslookup mystorage.blob.core.windows.net
Test-NetConnection mystorage.blob.core.windows.net -Port 443

If the FQDN resolves publicly from on-premises, fix conditional forwarding or enterprise DNS records for the private endpoint zone. If it resolves to the private endpoint IP, inspect VPN routes, Azure Firewall rules, NSGs if applicable, and storage firewall settings. Also confirm the private endpoint connection is approved and the private DNS zone group has the correct record.

Decision tree:

On-premises cannot reach storage after public access disabled
|-- Does storage FQDN resolve to private endpoint IP from on-premises?
|   |-- No: fix conditional forwarding to Azure private DNS or create correct internal records.
|   |-- Yes: continue.
|-- Does VPN route on-premises traffic to the private endpoint subnet?
|   |-- No: fix routes, BGP, or local network gateway prefixes.
|   |-- Yes: continue.
|-- Do firewalls allow TCP 443 to the private endpoint IP?
|   |-- No: update Azure Firewall or on-prem firewall rule.
|   |-- Yes: inspect private endpoint approval, storage firewall, and service logs.

How to answer case-study questions

In a Microsoft exam case study, not every fact is equally important. Highlight the failing source, destination, protocol, and recent change. Then map the symptom to the layer. By IP works, by name fails means DNS. One subnet works, another fails means VNet link, NSG, route, or firewall scope. One backend fails means probe, backend membership, guest firewall, app listener, or deployment drift. After public access was disabled means private endpoint path and DNS.

Use the smallest fix that satisfies the requirement. If the private DNS zone is not linked to the spoke VNet, link it; do not recreate the storage account. If the probe path is wrong, update the probe; do not scale the VM. If an NSG denies the app port, add a precise allow rule; do not remove the NSG.

Administrator checklist

For every connectivity incident, capture:
- Source IP, subnet, and network location.
- Destination FQDN, expected IP, protocol, and port.
- Actual DNS result from the affected source.
- Effective route or next hop from the source or backend.
- NSG decision for the flow.
- Load balancer frontend, rule, backend pool, and probe state if applicable.
- Guest OS listener and firewall state.
- Recent changes to DNS, routes, NSGs, private endpoints, or app deployment.

This checklist is intentionally repetitive. Connectivity problems are solved by proving the path, not by memorizing one favorite command. In AZ-104, the correct answer is often the diagnostic that isolates the layer, not the most dramatic configuration change.

Test Your Knowledge

In the case lab, on-premises users cannot reach a storage account by FQDN after public network access was disabled. What should be checked first?

A
B
C
D
Test Your Knowledge

Internet users receive intermittent errors through a public load balancer. Which cause best matches an uneven backend symptom?

A
B
C
D
Test Your Knowledge

A web VM cannot connect to api.corp.internal by name. Which diagnostic should be performed from the web VM first?

A
B
C
D