02. DNS Deep Dive — how the internet's phone book really answers¶

~13 min read. Name lookups feel tiny, but they shape every first request.

Built on the ELI5 in 00-eli5.md. The phone book — name to number lookup — decides which address your packets chase.

1) Start with the simple question¶

Humans remember names like api.bank.example. Networks forward packets using numeric destinations. DNS translates names into records that clients can use. Usually the browser asks a local stub resolver first. That stub forwards the hard work to a recursive resolver. ┌────────────┐ query ┌──────────────┐ │ Browser │ ─────────────▶ │ Stub resolver│ └────┬───────┘ └────┬─────────┘ │ │ │ ▼ │ ┌──────────────┐ └─────────────────────▶ │ Recursive DNS│ └──────────────┘ The browser wants one answer. The recursive resolver does the walking. That difference creates recursive versus iterative behavior.

2) Recursive versus iterative, very clearly¶

In recursive resolution, the client asks for a final answer. The resolver must return an IP, CNAME chain, or failure. In iterative resolution, each server gives the best next clue. It says, "I do not know, but ask this next server." Root servers do not know your final host record. They know which TLD servers handle .com, .in, or .org. TLD servers do not know every host either. They know which authoritative servers own a specific domain. Authoritative servers hold the final source records. Worked example keeps the flow memorable. Query: shop.example.com. Root reply: "Ask the .com TLD servers." TLD reply: "Ask ns1.example.com." Authoritative reply: shop.example.com is 203.0.113.10. So the browser asked recursively once. The recursive resolver asked iteratively several times.

3) Walk the hierarchy with concrete timing¶

Assume a cold lookup from Bengaluru to a public resolver. Local stub to recursive resolver: 4 ms. Recursive to root server: 18 ms. Recursive to .com TLD server: 16 ms. Recursive to authoritative server: 20 ms. Recursive back to client with final answer: 4 ms. Total visible DNS time: about 62 ms. Now watch the same lookup with caching. If the recursive resolver already knows the TLD referral, save 16 ms. If it also knows the final A record, save another 20 ms. Then the user sees only the 4 ms local round trip. That is why cache locality feels magical. A hierarchy diagram helps fix names in memory. ┌──────────────┐ │ Root servers │ └──────┬───────┘ │ referral for .com ▼ ┌──────────────┐ │ TLD servers │ └──────┬───────┘ │ referral for example.com ▼ ┌──────────────────────┐ │ Authoritative DNS │ │ ns1.example.com │ └─────────┬────────────┘ │ final A/AAAA/CNAME answer ▼ 203.0.113.10

4) Caching and TTL decide freshness¶

Every DNS record can carry a TTL value. TTL means "this answer stays reusable for these many seconds." Example record: A 203.0.113.10 TTL 300. That means caches may reuse it for five minutes. Browser cache may keep it briefly. OS cache may keep it again. Recursive resolver cache may keep it for many users. Three cache layers can hide the same lookup cost. Worked example with traffic makes TTL practical. Suppose one million users hit api.foodapp.com every hour. TTL is 60 seconds. The recursive resolver may refresh roughly once per minute per location. TTL is 3600 seconds. Refreshes become far rarer, but failover becomes slower. So TTL is a tradeoff between agility and lookup load. Low TTL helps cutovers. High TTL helps stability and resolver efficiency. Negative caching matters too. If a record does not exist, resolvers may cache that miss briefly. That avoids repeated expensive lookups for broken names. Record types matter during debugging too. A record maps a name to IPv4 address bytes. AAAA record maps to IPv6. CNAME points one name at another canonical name. MX points mail delivery toward mail servers. TXT carries metadata, verification strings, or policy hints. NS names the authoritative servers for a zone. If a CNAME exists, the resolver may need another lookup. That extra step adds latency and another cache layer. So record choice affects both behavior and troubleshooting. Interview answers sound stronger when you name common record types.

5) DNS over HTTPS and failure handling¶

Classic DNS often uses UDP on port 53. DNS over HTTPS wraps the question inside HTTPS. That hides the query from many middleboxes on the path. It also lets the query reuse web security controls. But DoH does not make wrong answers magically correct. Trust still depends on the resolver you choose. Failure handling matters because DNS is rarely perfectly clean. Maybe the recursive resolver times out. Maybe the authoritative server is slow. Maybe the answer is stale. Maybe the domain has two authoritative nameservers, and one is broken. Resolvers usually retry, switch servers, or serve cached data briefly. Example failure path makes this concrete. Authoritative server one times out after 2 seconds. Resolver asks authoritative server two after 200 ms. Server two answers in 18 ms. The user feels a slow lookup, not a total outage. Fallbacks hide pain, but they do not remove it. Fallback order also shapes user experience. Browsers may try IPv6 first, then IPv4 after failure. Resolvers may rotate among several public recursive servers. Applications may retry after 100 ms or 500 ms. Some mobile networks intercept or rewrite plain DNS traffic. DoH can avoid some of that interference. Corporate networks may still enforce their own resolver policies. So DNS bugs often look like geography-specific mysteries. Always ask which resolver answered and from where. That question saves hours during incident response. The post office analogy still helps here. The phone book is consulted first. Then packets start visiting post offices only after an address exists. One migration example ties everything together. Suppose api.payments.com moves from Mumbai to Hyderabad. Current TTL is 1800 seconds. Users may keep old answers for up to 30 minutes. If you lower TTL to 60 seconds one day earlier, most fresh lookups converge within one minute after the cutover. That is why planned migrations start with TTL prep, not with the final record change alone. CDN flips, database failovers, and blue-green deploys all use this trick. Bad TTL hygiene makes rollback painfully slow. Good TTL hygiene makes cutovers boring. Resolver logs reveal whether the new answer is spreading. Client reports reveal which caches are still stale. Both views matter during migrations. The phone book controls speed before the post office even starts.

Where this lives in the wild¶

Cloudflare DNS engineer tunes TTLs and resolver behavior for large customer domains. Small TTL choices change failover speed during real incidents.
Google SRE cares about recursive resolver latency for billions of daily lookups. Every saved millisecond multiplies across enormous traffic.
Netflix edge engineer manages authoritative DNS for regional traffic steering. DNS answers can decide which region receives a user.
Razorpay platform engineer watches certificate renewal and DNS cutovers together. A stale DNS cache can send payments toward the wrong endpoint.
Shopify infrastructure engineer uses low TTL before moving storefront traffic. Safe cutovers depend on expected cache expiry behavior.

Pause and recall¶

Who performs recursive work, the browser or the recursive resolver?
What does a TLD server usually return during lookup?
Why can a high TTL help and hurt at the same time?
What privacy problem does DNS over HTTPS try to reduce?

Interview Q&A¶

Q1. Explain recursive and iterative DNS in one clean answer. Say the client asks recursively for a final answer. Then say the resolver walks iteratively through hierarchy referrals. Common wrong answer to avoid: "Root servers always return the final IP address." Q2. What is the job of root, TLD, and authoritative servers? Root points to TLD, TLD points to authoritative, authoritative returns records. That layered design scales the global namespace cleanly. Common wrong answer to avoid: "Authoritative servers are just faster recursive resolvers." Q3. How does TTL affect system behavior during deployments? Low TTL speeds traffic movement after a record change. High TTL lowers query volume but delays full cutover. Common wrong answer to avoid: "TTL only changes cache memory usage." Q4. What happens when one authoritative server is down? Resolvers retry another listed authoritative server if available. Users may see latency spikes before complete failure. Common wrong answer to avoid: "DNS always fails immediately if one server times out."

Apply now (5 min)¶

Pick a familiar domain, like www.github.com or www.flipkart.com. Write the likely hierarchy: root, TLD, authoritative, final record. Assume the A record TTL is 300 seconds. Ask yourself when every cache layer would expire that answer. Now imagine you change the IP during an outage. Estimate how long stale answers may survive. Sketch from memory Draw the resolver asking root, TLD, and authoritative servers. Write one referral beside each arrow. Mark where TTL is attached. Mark where cached answers can short-circuit the walk.

Bridge. Once DNS finds the destination, the next question is delivery style. Do we want guaranteed, ordered transport, or fast best-effort transport? → 03-tcp-and-udp.md