Skip to content

06. Load Balancing and Routing — The toll booth that decides who goes where

~14 min read. One smart entry point can save ten drowning servers behind it.

Built on the ELI5 in 00-eli5.md. The toll booth — the single entry point that inspects, routes, and limits traffic — now becomes the guard that keeps any one server from getting crushed.


First understand what the toll booth is doing

At HLD level, a load balancer is not just a traffic splitter. It is the toll booth sitting before a pool of servers, deciding which request goes to which instance. If you remove that booth, clients must know server addresses, hot spots form quickly, and failed nodes keep getting traffic. See. The toll booth gives you indirection. That indirection is what makes horizontal scaling practical. A basic picture looks like this: ┌─────────┐ ┌──────────────┐ │ Clients │ ─→ │ Load balancer│ └─────────┘ └──────┬───────┘ │ ┌──────────┼──────────┐ ▼ ▼ ▼ ┌────────┐ ┌────────┐ ┌────────┐ │ App-1 │ │ App-2 │ │ App-3 │ └────────┘ └────────┘ └────────┘ The client sees one address. The server fleet can grow, shrink, or heal behind it. Now what is the real HLD question? Not "Should we use a load balancer?" Almost every serious system will. The real question is how smart the toll booth should be and what signal it should use.

Layer 4 vs Layer 7: route by transport or by meaning

Layer 4 load balancing works at transport level. It looks at IPs and ports. It does not inspect HTTP path, headers, cookies, or method. Layer 7 load balancing works at application level. It can inspect URL path, host, headers, auth clues, and even request type. Simple rule: - use Layer 4 when you want fast generic transport distribution - use Layer 7 when routing depends on request meaning Layer 4 is like a toll booth checking lane type only. Layer 7 is like a booth reading the destination slip on each truck. Example. Suppose api.example.com/payments must go to the payments service, while api.example.com/search must go to the search service. That needs Layer 7 understanding. Now another case. Suppose you only need to spread TCP connections across identical game servers. That can be Layer 4. A common comparison: ┌────────────────────┬──────────────────────────────┐ │ Layer 4 │ Layer 7 │ ├────────────────────┼──────────────────────────────┤ │ routes by IP/port │ routes by path, host, header │ │ lower overhead │ richer decisions │ │ protocol-agnostic │ HTTP/gRPC aware │ │ less flexible │ more flexible │ └────────────────────┴──────────────────────────────┘ So what to do? If the backend pool is homogeneous, Layer 4 may be enough. If traffic needs content-aware routing, auth-aware routing, or canary release logic, Layer 7 usually wins.

Core algorithms: round-robin, least-connections, consistent hashing

Now we come to the decision rules inside the booth.

Round-robin

Send request 1 to server A, request 2 to B, request 3 to C, then repeat. This is simple and often good enough when servers are identical and requests are short. Worked example. Suppose incoming traffic is 24,000 requests per second and you have 6 equal servers. 24,000 ÷ 6 = 4,000 requests per second per server on average. That looks neat. But now what is the hidden issue? If one request takes 5 ms and another takes 500 ms, equal request count does not mean equal load.

Least-connections

Send new traffic to the server with the fewest active connections. This helps when connection duration varies a lot, such as WebSockets, long polling, or slow uploads. Example with numbers. Assume three servers currently have 800, 200, and 150 active connections. A least-connections balancer sends the next connection to the third server, not the first one. That sounds obvious, but it is much smarter than plain round-robin for long-lived traffic.

Consistent hashing

Route traffic based on a stable key, such as user ID, session ID, or cache key. This is useful when the destination should stay sticky or when backend state is partitioned. Now the common comparison. Naive modulo hashing uses user_id % N. Suppose you have 5 cache nodes and 10 million keys. Average per node = 10,000,000 ÷ 5 = 2,000,000 keys. Add one more node. N becomes 6. With modulo hashing, most keys remap because the divisor changed. With consistent hashing, expected moved fraction is about 1 divided by new node count. 1 ÷ 6 = 16.7%. Moved keys = 10,000,000 × 0.167 = about 1,670,000 keys. See the win. You move roughly one-sixth of the keys, not almost all of them. This is why stateful caches and sticky-session designs love consistent hashing.

Routing strategies at HLD level

Load balancing is not only one algorithm choice. It is also a routing strategy decision.

1) One pool, identical workers

All instances serve the same API and can take any request. Round-robin or least-connections is fine.

2) Path-based routing

/search goes to search fleet. /checkout goes to payments fleet. This is usually Layer 7.

3) Weighted routing

Send 90% of traffic to stable version and 10% to canary version. This helps during gradual rollout.

4) Sticky routing

Keep one user's requests on the same backend using a session key. This can reduce cache misses, but it can also create hot spots if one user is huge.

5) Geography-aware routing

Send users to the nearest healthy region. This reduces latency and helps with disaster boundaries. A practical HLD diagram may look like this: ┌─────────┐ HTTPS ┌──────────────┐ │ Clients │ ───────→ │ L7 balancer │ └─────────┘ └──────┬───────┘ │ ┌────────────────┼────────────────┐ ▼ ▼ ▼ /search fleet /checkout fleet /profile fleet Inside each fleet, you may still run another internal toll booth. That second layer spreads traffic across service instances in that domain.

Health checks: do not keep sending cars to a broken lane

A load balancer is only useful if it knows who is healthy. That is where health checks enter. Active health checks ask servers, "Are you alive?" Passive health checks observe real failures and reduce traffic automatically. Good health design checks more than "process exists." A server can accept TCP and still be useless because database access is dead, memory is exhausted, or thread pools are stuck. So what should a health check test? - can the process respond quickly? - can it reach required dependencies? - is latency still below a safe threshold? - should this server receive new traffic or only drain old traffic? Worked failure example. Suppose you have 8 app servers behind one toll booth and total demand is 80,000 RPS. Normal share per server = 80,000 ÷ 8 = 10,000 RPS. Now one server starts timing out. If health checks miss it, that server still receives about 10,000 RPS and users see avoidable errors. If health checks remove it, traffic redistributes across 7 servers. 80,000 ÷ 7 = about 11,429 RPS per server. That is a 14.29% increase. So the architecture question is not only "Can we detect failure?" It is also, "Can the remaining fleet absorb the rebalanced load?" This is where capacity planning meets routing.

The load balancer itself can become the bottleneck

Now what is the irony? The toll booth that prevents one server from drowning can itself drown. So you design the balancer tier with redundancy too. Common protections: - multiple balancer instances or managed balancer nodes - health checks on the balancers themselves - anycast or DNS-based failover across regions - connection limits and rate limiting at the edge - autoscaling for the backend pools, not just the front door Another worked example. Suppose one balancer instance can safely terminate 40,000 TLS handshakes per second. Traffic forecast says peak is 90,000 new TLS handshakes per second. Step 1: required balancers without safety buffer. 90,000 ÷ 40,000 = 2.25, so you need at least 3 instances. Step 2: add N+1 thinking. If one balancer dies, 2 must still carry the load. 90,000 ÷ 2 = 45,000 each. That exceeds 40,000 safe capacity. So three is not enough. Step 3: try 4 instances. If one dies, 90,000 ÷ 3 = 30,000 per instance. Now you have headroom. See. A resilient front door is capacity math plus failure math.

Choosing the right strategy for the workload

Use round-robin for simple stateless request pools. Use least-connections for uneven request duration or long-lived sessions. Use consistent hashing when destination stability matters, especially for caches, partitions, or sticky state. Use Layer 7 when routing logic depends on path, host, headers, or progressive delivery rules. Use Layer 4 when you need fast generic transport balancing across similar backends. The wrong pattern usually shows up in one of three ways: 1. one server runs hot while others look idle 2. failed nodes still receive traffic 3. adding a node causes huge reshuffling or cache misses When you see these symptoms, inspect the toll booth logic before blaming the app code.


Where this lives in the wild

  • Cloudflare — edge routing distributes user traffic to nearby healthy locations and applies Layer 7 rules, rate limits, and DDoS protections before origin systems are touched.
  • AWS Application Load Balancer — path-based and host-based routing sends /api, /checkout, and /images traffic to different target groups with health checks and weighted rollouts.
  • Google Maglev — consistent hashing and efficient connection distribution keep traffic stable even as backend servers join or leave large service pools.
  • Netflix Zuul and edge gateways — front-door routing, resilience controls, and service-aware policies prevent single backend clusters from absorbing traffic blindly.
  • NGINX in high-traffic deployments like GitHub Enterprise and many SaaS platforms — least-connections, health checks, and Layer 7 routing are used to spread HTTP load across fleets of app servers.

Pause and recall

  1. Why is a load balancer more than just a simple traffic splitter?
  2. When would least-connections beat round-robin clearly?
  3. Why does consistent hashing move far less data than modulo hashing when nodes change?
  4. What two questions must you ask after a failed node is removed from rotation?

Interview Q&A

Q: Why choose Layer 7 load balancing instead of Layer 4 for a microservice API gateway? A: Because routing decisions may depend on URL path, host, headers, auth state, or canary weights. Layer 7 understands application meaning, so it can send /search and /payments to different pools and apply richer policies. Common wrong answer to avoid: "Layer 7 is always faster" — it is usually more flexible, not cheaper in overhead. Q: Why use least-connections instead of round-robin for WebSocket-heavy traffic? A: Because long-lived connections make equal request counts misleading. Least-connections steers new sessions toward less-burdened servers, while round-robin can keep feeding already busy nodes. Common wrong answer to avoid: "Round-robin is fair, so it is always balanced" — fairness by count is not fairness by actual load. Q: Why choose consistent hashing for a distributed cache and not plain modulo hashing? A: Because cache nodes change over time. With consistent hashing, adding or removing a node remaps only a small fraction of keys, so cache warmth and backend stability are preserved much better. Common wrong answer to avoid: "Consistent hashing guarantees perfect balance" — it reduces remapping pain, but real balance still depends on key distribution and virtual nodes. Q: Why are health checks and capacity planning inseparable in load balancing? A: Because removing an unhealthy node immediately increases load on the remaining fleet. A system is not truly healthy if it can detect failure but cannot survive the redistribution that follows. Common wrong answer to avoid: "Health checks solve failures automatically" — they only stop bad routing; the surviving servers still need enough headroom.


Apply now (5 min)

Imagine an API fleet with 5 identical app servers handling 25,000 RPS. First compute the average RPS per server. Then remove one server and recompute the new average. Now ask whether round-robin is still enough if one endpoint keeps long-lived connections open for 30 seconds. Write one sentence for the algorithm you would switch to and why. Next, assume a cache cluster has 12 million keys across 4 nodes. Add a fifth node. Using the consistent-hashing shortcut, what fraction of keys move, and about how many keys is that? Show every step. Sketch from memory: draw one toll booth, three backend servers, and labels for one routing rule, one health check, and one failover action. Do not peek back.


Bridge. Traffic is spread. But every request still hits the database, and that becomes the next hot path. → 07-caching-at-system-level.md