11. Network Debugging — opening envelopes at every hop¶
~17 min read. Stop guessing. Inspect the path, the name, and the packet.
Built on the ELI5 in 00-eli5.md. The envelope — debugging means opening envelopes at each post office — reveals where truth changes.
1) Start with a calm debugging ladder¶
When traffic fails, panic creates random commands and random conclusions. Do not debug like that. Use a fixed ladder every single time. First ask whether the name resolves correctly. Then ask whether the host is reachable at all. Then ask whether the route is sane. Then ask whether the service is listening. Then ask whether the application is rejecting the request. This order saves enormous time. A broken DNS answer can mimic a dead server. A firewall drop can mimic an application timeout. A TLS mismatch can mimic a network outage. So isolate one layer before touching the next. Keep this ladder in your head.
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ DNS │→│ Reachable │→│ Route │→│ Port open │→│ App okay │
└──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘
Debugging is really controlled subtraction. Remove uncertainty, one layer at a time. Also capture facts while debugging. Exact hostname matters. Exact source machine matters. Exact timestamp matters. Exact destination port matters. Without those four facts, teammates reproduce nothing reliably. The address on the packet must be known first. The phone book entry must also be known first.
2) Confirm names and basic reachability¶
Use nslookup or dig to check DNS answers.
Look for returned IPs, TTL values, and authoritative servers.
If production returns 34.120.10.8 today, note it exactly.
If staging returns 10.0.4.22, note that difference too.
A split-horizon DNS setup often changes answers by location.
That is normal, but you must notice it.
Worked example.
api.shop.internal returns 10.2.8.15 inside the VPC.
The same name returns nothing on your laptop.
That means private DNS is working as designed.
It does not mean the service is dead.
After DNS, try ping if ICMP is allowed.
But learn the limitation immediately.
Many systems block ICMP even when the app is healthy.
So failed ping is not a final verdict.
It is one clue only.
Traceroute then shows the path hop by hop.
Each hop is one post office handling the envelope.
If the path dies after hop three, investigate there.
If latency jumps from 4 ms to 120 ms, note that jump.
Laptop
│ 1 ms
▼
Office router
│ 3 ms
▼
ISP edge
│ 5 ms
▼
Cloud ingress
│ 120 ms ← suspicious jump
▼
Service subnet
Now use curl for application truth.
curl -I checks headers quickly.
curl -v shows DNS resolution, TCP connect, TLS, and HTTP status.
One verbose request often saves twenty vague theories.
If you get HTTP/1.1 503, the network path probably worked.
The failure moved upward into the application or dependency layer.
3) Inspect packets when the symptom still lies¶
Sometimes higher-level tools disagree with each other.
That is when packet capture becomes gold.
Use tcpdump on servers for lightweight terminal captures.
Use Wireshark when you need deeper packet timelines visually.
Packet capture answers brutally simple questions.
Did the SYN leave the client?
Did the SYN-ACK return?
Did TLS ClientHello arrive?
Did the server reset the connection?
These questions end arguments very quickly.
Look at a healthy TCP start.
Client Server
│ SYN seq=100 ─────────▶ │
│ ◀───────── SYN,ACK 101 │
│ ACK 102 ─────────▶ │
│ TLS ClientHello ───────▶ │
Now look at a SYN flood symptom.
Client Server
│ SYN 100 ─────────▶ │
│ SYN 200 ─────────▶ │
│ SYN 300 ─────────▶ │
│ ...many half-open connections... │
Worked example with concrete numbers.
Suppose curl hangs for 3.0 seconds before failing.
tcpdump shows three SYN retries at one-second gaps.
That pattern suggests no SYN-ACK is returning.
So the failure is likely routing, firewall, or an unavailable listener.
Different example now.
curl -v https://api.example.com connects instantly.
TLS handshake completes in 42 ms.
Then response stalls for 12 seconds.
Packets are flowing.
So the network path is not your main suspect anymore.
The app or downstream dependency probably blocks the response.
Packet captures should answer a hypothesis, not replace thinking.
Otherwise you drown in beautiful but useless bytes.
4) Use a repeatable methodology under pressure¶
Start from the client side first.
Reproduce from a machine near the failing user if possible.
Then reproduce from inside the same VPC or cluster.
Then reproduce directly from the target host itself.
This triangle tells you where the problem begins.
If local host succeeds but remote client fails, inspect the middle.
If everybody fails, inspect the target service immediately.
Use a tiny checklist.
1. dig the name.
2. ping only if policy allows ICMP.
3. traceroute the path.
4. curl -v the endpoint.
5. ss -lntp or netstat for listening ports.
6. tcpdump or Wireshark when packets must settle disputes.
Now a full worked example.
A mobile client reports checkout timeout at 10:05 AM.
dig checkout.example.com returns 18.60.8.14 with TTL 60.
traceroute reaches the load balancer in 11 hops.
curl -v https://18.60.8.14/health returns 200 in 95 ms.
But curl -v https://checkout.example.com/pay returns 502.
This tells you DNS and network reachability look fine.
The hostname path, host header, or upstream app target is broken.
Another example with private networking.
From a bastion host, dig db.internal returns 10.3.9.20.
nc -zv 10.3.9.20 5432 times out.
tcpdump on the database shows nothing arriving.
So packets die before reaching the database host.
Next checks are route tables, NACLs, and security groups.
That is disciplined narrowing, not magic.
Please remember one final warning.
Do not run packet captures everywhere forever.
Capture narrowly, with a reason, and for a bounded duration.
Otherwise noise consumes your attention and storage.
Where this lives in the wild¶
- SRE at Cloudflare: traces latency spikes hop by hop before blaming origin servers.
- Platform engineer at PhonePe: uses
digandcurl -vto separate DNS drift from API failures. - Production support lead at Google: captures SYN and RST patterns to confirm connection-path issues quickly.
- Security analyst at Microsoft: inspects packet captures in Wireshark during suspected exfiltration or scanning events.
- Infra engineer at Dream11: debugs private-database reachability from bastions using traceroute alternatives and tcpdump.
Pause and recall¶
- Why should DNS be checked before packet capture in most incidents?
- What does a successful
curl -vbut failed business request usually suggest? - Why is failed
pingnot enough to declare a service down? - What exact question should
tcpdumpanswer before you start capturing?
Interview Q&A¶
Q1. How do you debug a timeout systematically?
Check DNS, reachability, route, port availability, and application response in order.
Then use packet capture only when those checks still leave ambiguity.
Common wrong answer to avoid: “I start with tcpdump because packets never lie.”
Q2. What is traceroute actually telling you?
It reveals the path hops and where latency or drops appear.
It does not prove the application itself is healthy.
Common wrong answer to avoid: “If traceroute reaches the host, the whole request path is fine.”
Q3. When is curl -v more useful than ping?
When you need DNS, TCP, TLS, and HTTP details in one view.
It tests the real application path more directly.
Common wrong answer to avoid: “Ping is always the best first network command.”
Q4. Why use Wireshark if tcpdump already exists?
Wireshark helps when visual timelines, streams, and protocol decoding matter.
Tcpdump is lighter for fast captures on live servers.
Common wrong answer to avoid: “Wireshark is only for beginners who fear the terminal.”
Apply now (5 min)¶
Pick one URL you can safely test right now.
Run dig, traceroute, and curl -v against it.
Write three facts: resolved IP, hop count, and HTTP status.
Then imagine one symptom and state the next command you would run.
Sketch from memory: draw the debugging ladder and place each tool beside its layer.
Bridge. Excellent. Once you can see the path clearly, the next question is control. How do we slow abusive traffic before it crushes the path? → 12-rate-limiting-at-network-layer.md