2.1 Structured troubleshooting method & root-cause analysis

Key Takeaways

  • The troubleshooting cycle runs in a fixed order: identify the problem, theorize a probable cause, test it, plan, implement, verify, and document.
  • Verifying and documenting are mandatory, because an unverified fix is only a guess and an undocumented one forces the next technician to start over.
  • The 5 Whys drills past symptoms to the deepest removable cause, which is where a durable fix belongs.
  • Divide-and-conquer testing halves the suspects with each check, isolating a fault in a chain of eight components in about three tests instead of eight.
  • Changing only one variable at a time is what lets you attribute a result to a specific cause.
Last updated: July 2026

The six-step troubleshooting cycle

Cyber work rewards a repeatable method over guessing. The industry-standard model, mirrored on the DoD Cyber Test, walks through a fixed order of stages. Skipping a stage is the single most common mistake, because it lets you "fix" a symptom while the real fault survives.

The steps in order

  1. Identify the problem. Gather information, question the user, and note what changed. A precise statement ("laptops on the third floor lost network access at nine o'clock") beats a vague one ("the internet is broken").
  2. Establish a theory of probable cause. Ask what could produce these exact symptoms, list candidate causes, then rank them from most to least likely.
  3. Test the theory. Prove or disprove the top candidate with one deliberate check. If it is confirmed, move on; if not, form a new theory or escalate.
  4. Create a plan of action. Decide the fix, its rollback, and its side effects before you touch anything.
  5. Implement the fix (or escalate). Apply one change at a time so you can tell what actually worked.
  6. Verify full functionality, add preventive measures, then document the cause, the fix, and the outcome.

A memory hook is Identify to Theory to Test to Plan to Implement to Verify to Document. Notice that verify and document are not optional pleasantries: an unverified fix is a guess, and an undocumented fix helps no one the next time the fault returns.

One extra habit sharpens step one: always ask what changed just before the symptom appeared. The overwhelming majority of faults trace to a recent change, whether a new patch, a configuration edit, or a cable someone moved, so the question "what is different since it last worked?" often hands you a strong theory for free. When two symptoms appear together, also ask whether one caused the other or whether both share a single upstream cause; scoping that relationship early prevents you from chasing two problems that are really one.

Test Your Knowledge

In the six-step troubleshooting cycle, what must you do immediately after you establish a theory of probable cause?

A
B
C
D

Worked example: mapping symptoms to steps

Suppose a web server stops responding at ten in the morning.

StepAction takenResult
IdentifyConfirm the outage began at ten and hits only external usersInternal users are fine
TheorySuspect the firewall rule changed during the nine forty-five deployRanked most likely
TestCompare current firewall rules to yesterday's backupOne inbound rule is missing
PlanRe-add the rule; rollback is to remove it againLow risk
ImplementAdd the single missing ruleChange applied
VerifyLoad the site from an outside networkSite responds
DocumentRecord cause (deploy dropped a rule) and fixTicket closed

Each step feeds the next. Because the problem was scoped to external users only, the theory pointed at the perimeter firewall rather than the server software, saving hours of looking in the wrong place.

Root-cause analysis

Fixing a symptom stops the pain; root-cause analysis stops the recurrence. Two techniques appear on aptitude tests.

The 5 Whys

Ask "why" until you reach a cause you can actually remove.

  • Why is the site down? The firewall is blocking port eighty.
  • Why is it blocking port eighty? A rule was deleted.
  • Why was the rule deleted? The deploy script rebuilds all rules from a template.
  • Why does the template omit the rule? No one added it after the last change.
  • Why was it never added? Rule changes are not tracked in version control.

The genuine root cause is the last answer, untracked rule changes, not the blocked port. Re-adding the rule leaves the next deploy free to delete it again.

Test Your Knowledge

A data path runs through eight components in series and exactly one is failing. Using divide-and-conquer, what is the maximum number of tests needed to isolate the faulty component?

A
B
C
D

Divide and conquer (binary search)

When a fault could live anywhere along a chain, cut the chain in half and test the midpoint. Each test eliminates half of the remaining possibilities. Imagine data passes through eight components in series, numbered one to eight, and one is failing.

  • Test after component four. If the data is still good there, the fault is in components five through eight.
  • Test after component six. If the data is bad there, the fault is component five or six.
  • Test after component five. If good, the culprit is component six.

Three tests located the fault among eight suspects, because each check halved the field: eight to four to two to one. Checking components one by one would have taken up to eight tests, which is why starting in the middle beats starting at the beginning for long chains.

Knowing when to escalate is part of the method, not a failure of it. If your top two or three theories are all disproved, or the fix would exceed your authority or your tolerance for risk, hand the problem up with your notes attached rather than guessing at increasingly unlikely causes. A clean record of what you already tested spares the next responder from repeating your steps, and it turns your dead ends into useful eliminations for whoever picks up the ticket.

Common traps

  • Changing more than one thing at once. If you swap the cable and reboot the router and it works, you never learn which fix mattered, and you may have added a new fault.
  • Confirmation bias. Testing only the theory you like, and ignoring evidence against it, wastes the test step.
  • Stopping at the first symptom. A green light on the router does not prove end-to-end connectivity; verify at the layer the user cares about.
  • Skipping documentation. The fault will return, possibly to a different technician, and the clock restarts from zero.

Treat the method as a checklist: scope tightly, theorize, test one variable, then verify and record. The exam rewards the candidate who follows the order over the one who jumps straight to a plausible-sounding fix.

Test Your Knowledge

Using the 5 Whys, a site is down because a deploy script deleted a firewall rule while rebuilding rules from a template, and rule changes are never tracked in version control. What is the true root cause?

A
B
C
D