8.4 Load Balancing Troubleshooting Workflow

Key Takeaways

  • Troubleshooting should follow the packet path from client DNS to frontend IP, rule, probe, backend listener, NSG, route, and application response.
  • A healthy VM is not the same as a healthy load balancer backend; probe status controls new flow eligibility.
  • Standard Load Balancer failures frequently involve NSG rules, probe mismatch, backend pool membership, or asymmetric routing.
  • Testing from the right source network is critical because public, private, peered, and hybrid clients can see different results.
  • Use Network Watcher and guest OS tests together; neither view alone proves the whole path.
Last updated: May 2026

Troubleshoot by path, not by guessing

When a load-balanced service fails, the symptoms can be misleading. Users say the website is down, but the cause might be DNS, a public IP change, an NSG deny, a wrong probe path, a backend pool that is empty, a user-defined route to a firewall, or an application process that stopped listening. The administrator's job is to reduce the problem to a specific hop.

Use this workflow for most Azure Load Balancer cases:

Client reports failure
|-- Step 1: Does the client resolve the expected name to the load balancer frontend IP?
|-- Step 2: Is the client connecting to the expected protocol and port?
|-- Step 3: Does the load balancer rule exist for that frontend, protocol, and port?
|-- Step 4: Does the rule reference the correct backend pool and health probe?
|-- Step 5: Are backends in the pool and marked healthy?
|-- Step 6: Do NSGs allow client traffic and probe traffic?
|-- Step 7: Do routes deliver traffic to the backend and return traffic correctly?
|-- Step 8: Is the application listening and returning a valid response?

This order matters because it prevents unnecessary rebuilds. If DNS points to an old public IP, changing the health probe is wasted effort. If the probe is unhealthy because the app is down, changing the load balancer SKU does not solve the problem.

Step 1: DNS and frontend IP

Confirm that the user is connecting to the intended endpoint. Public clients should resolve to the public frontend IP. Internal clients should resolve to the internal frontend IP. Hybrid clients may use conditional forwarding and private DNS, so test from a client on the affected network.

Commands:

nslookup api.contoso.com
Test-NetConnection api.contoso.com -Port 443
curl -v https://api.contoso.com/health
az network public-ip show -g rg-network -n pip-web --query ipAddress -o tsv

If the name resolves incorrectly, fix DNS before changing the load balancer. For internal load balancers, check the private DNS zone record and VNet links. For public load balancers, check the public A or CNAME record, TTL, and whether the public IP changed during redeployment.

Step 2: Rule and backend pool

The rule must match the user traffic. A TCP rule on frontend port 443 does not handle UDP. A rule on frontend port 80 does not handle HTTPS unless clients really connect to 80. The backend pool must contain the intended instances.

Portal path: Load balancer > Load balancing rules, then inspect the frontend IP, protocol, frontend port, backend port, backend pool, and health probe. Then go to Backend pools and confirm membership.

CLI checks:

az network lb rule list -g rg-network --lb-name lb-web-public -o table
az network lb address-pool show -g rg-network --lb-name lb-web-public -n be-web

Step 3: Probe health

Probe state explains many one-backend and no-backend failures. If all probes fail, the load balancer has no healthy destinations for new flows. If one probe fails, traffic distribution may look uneven because Azure correctly avoids that instance.

Test the probe endpoint from the backend itself first. If curl http://localhost:8080/health fails, fix the app before changing Azure. Then test from another VM in the VNet. If local succeeds but network test fails, look at the OS firewall, NSG, route table, and application binding address. An app bound only to 127.0.0.1 may pass local tests but fail from the network.

The NSG must also allow the probe. In many designs, an inbound allow from the AzureLoadBalancer service tag to the probe port is required. For client traffic, allow the client source or internet source to the application port as appropriate. Do not open broad ranges if a narrow source and port satisfy the requirement.

Step 4: Routes and asymmetric paths

User-defined routes can break load balancing. A subnet route might send traffic through a network virtual appliance, firewall, or gateway. That can be valid, but return traffic must be symmetric enough for stateful devices, and the appliance must allow the flow. If the client is on-premises, check ExpressRoute or VPN routes and whether the on-premises firewall knows the return path.

Next hop inspection is useful when the packet seems to vanish:

az network watcher show-next-hop \
  -g rg-network \
  --vm vm-web-1 \
  --source-ip 10.20.2.4 \
  --dest-ip 10.10.1.20

For inbound public load balancing, remember that the backend sees traffic according to Azure's load balancing behavior and the application response must route back correctly. UDRs that force default traffic to a firewall can alter outbound paths. If a firewall is in the path, inspect firewall logs and rules.

Step 5: Guest OS and application

Azure can route only to a listener that exists. On Linux, check ss -lntp, systemctl status, and app logs. On Windows, check netstat -ano, Windows Firewall, IIS bindings, service status, and event logs. If the process listens on IPv6 only, loopback only, or a different port, the load balancer will not fix it.

Guest OS firewalls are easy to miss because Azure NSGs may look correct. Windows Defender Firewall or Linux firewall rules can block the backend port or probe port. In exam scenarios, if Azure configuration is correct but only a specific VM fails, the guest firewall or application listener becomes likely.

Fast comparison of symptoms

SymptomMost likely areaFirst diagnostic
FQDN resolves to old addressDNSnslookup from affected client.
Frontend IP reachable from one network but not anotherNSG, route, firewall, DNS split horizonTest from both sources.
No backends receive trafficProbe, rule, backend pool, NSGCheck probe health and pool membership.
Only one backend receives trafficOther backends unhealthy or session persistenceInspect probe result per instance.
Backend works by direct IP but not through load balancerRule, probe, frontend, DNSCompare direct listener test with frontend test.
Probe healthy but app response brokenApplication path or dependencyReview app logs and dependency connectivity.

Exam approach

AZ-104 often describes a failure after a single change. Anchor on that change. If public network access was disabled on a service, DNS for a private endpoint becomes likely. If an NSG was tightened, check the application and probe ports. If a VM was replaced, check backend pool membership. If a new /healthz endpoint was deployed but the probe still checks /, update the probe.

Do not jump to deleting and recreating a load balancer. Most failures are configuration mismatches. A disciplined path test is faster and safer.

Test Your Knowledge

Users cannot reach api.contoso.com through a public load balancer. What is the best first check in a path-based workflow?

A
B
C
D
Test Your Knowledge

All backend VMs are running, but the load balancer sends no new flows to them. Which condition most directly explains this?

A
B
C
D
Test Your Knowledge

A backend app passes curl localhost but fails from another VM. What should you investigate next?

A
B
C
D