09. Scaling Dimensions — Bigger machine or more machines?¶

~18 min read. Growth is not one problem. Reads, writes, and state each push the system in a different direction.

Built on the ELI5 in 00-eli5.md. The kitchen can buy one giant stove, or add more prep stations — but the house rules decide when one machine stops being the sensible answer.

1) First picture: vertical and horizontal are different bets¶

See. Scaling is not magic. It is capacity plus coordination. Vertical scaling means a bigger machine. More CPU. More RAM. Faster disk. Horizontal scaling means more machines. More copies of the same service. More coordination too.

vertical scaling                    horizontal scaling
┌──────────────┐                    ┌──────────┐  ┌──────────┐  ┌──────────┐
│ one bigger   │                    │ node A   │  │ node B   │  │ node C   │
│ machine      │                    └────┬─────┘  └────┬─────┘  └────┬─────┘
└──────┬───────┘                         │             │             │
       │                                 └──────┬──────┴──────┬─────┘
       └── bigger box                           │ load balancer│
                                                └──────────────┘

Look. Vertical scaling is simpler first. No distributed coordination. No request routing changes. No cross-node state issues. That is why teams start there. Now what is the problem? The cost curve bends upward. A 2x bigger machine often costs more than 2x. There is also a ceiling. You cannot buy infinite single-box capacity. Horizontal scaling removes that ceiling for stateless tiers. But it adds failure modes. Retries. Load balancing. Uneven traffic. Cross-node debugging. Simple, no? So the question is not, "Which is better?" The question is, "Which limit is hurting us now?"

2) Horizontal scaling requires stateless services¶

This is the part candidates skip. You cannot spread traffic well if each node hides user state in memory. That is the session affinity trap.

bad:
client ──→ LB ──→ app-3 holds session in RAM
                     │
next request ────────┘ must return to app-3

good:
client ──→ LB ──→ any app node
                    │
                    └── session in Redis / DB / token

Why is sticky routing an anti-pattern? Because it couples scale to past placement. If app-3 dies, its sessions vanish or reconnect painfully. If app-3 gets hot, the load balancer cannot freely spread traffic. If you deploy a new version, draining gets harder. So what to do? Move per-user state out of process memory. Use signed tokens for lightweight session state. Use Redis or a database for shared session state when needed. Keep the service logic stateless. Then any node can serve the next order ticket. That is how auto-scaling becomes real. That is how rolling deploys become ordinary. That is how one hot node stops ruining the whole restaurant.

3) Database scaling usually happens in stages¶

Application servers are easier to scale than databases. So we scale the database more carefully. Stage one is vertical scaling. A bigger primary often buys time. Stage two is read replicas. Stage three is sharding.

writes ──→ primary DB
             │
reads ──→ replicas A, B, C

Read replicas help when reads dominate. Why? Because one node still owns writes. That keeps write correctness simpler. The replicas answer read traffic. Now what is the problem? Replica lag. A user writes a comment. Then refreshes immediately. Replica may not have it yet. So read-after-write paths may need primary reads or sticky freshness logic. Simple, no? Read replicas are great when: - reads are much higher than writes - some staleness is acceptable - write throughput still fits one primary Read replicas do not solve: - primary write bottlenecks - huge single-table growth forever - cross-region write contention When those become painful, sharding arrives. Not before. Read replicas are usually cheaper than sharding. Operationally and mentally.

4) Sharding splits the write path itself¶

Sharding means different subsets of data live on different primaries. Now the write load is spread too. There are three common schemes.

Range sharding¶

Users 1-1M on shard A. Users 1M-2M on shard B. Easy for range queries. Risky for hotspots.

Hash sharding¶

hash(user_id) % N picks the shard. Good distribution. Poor for range scans.

Directory-based sharding¶

A lookup service tells you where tenant X lives. Flexible. Operationally heavier. Look at the comparison.

┌───────────────┬────────────────────┬─────────────────────────┐
│ method        │ good at            │ main pain               │
├───────────────┼────────────────────┼─────────────────────────┤
│ range         │ ordered scans      │ hotspot ranges          │
│ hash          │ even distribution  │ cross-shard queries     │
│ directory     │ flexible placement │ routing metadata        │
└───────────────┴────────────────────┴─────────────────────────┘

Now what is the problem? Sharding pushes complexity upward. Cross-shard joins become expensive. Global unique ids matter more. Rebalancing data is harder. Hot tenants may still dominate one shard. So what to do? Shard on the axis your hottest queries already use. If most requests are by tenant_id, that is a strong candidate. If most requests are by user_id, start there. Do not shard by a field nobody queries. That only creates pain without locality.

5) Worked example: choosing the next scaling move¶

Suppose an API tier now handles 4,000 requests per second. Traffic is expected to reach 20,000 requests per second. Today one app node handles 2,000 requests per second safely. So current app fleet need is: 4,000 ÷ 2,000 = 2 nodes. At target traffic, app fleet need is: 20,000 ÷ 2,000 = 10 nodes. That sounds easy. Just add eight more nodes. But only if the app is stateless. Good. Now the database. Ninety percent of traffic is reads. At 20,000 total requests per second: reads = 20,000 × 0.9 = 18,000 reads/sec. writes = 20,000 × 0.1 = 2,000 writes/sec. Assume one primary can handle: - 3,000 writes/sec safely - 4,000 reads/sec if it also takes writes We should not send all reads to the primary. If one replica can handle 4,000 reads/sec, replicas needed are: 18,000 ÷ 4,000 = 4.5. Round up. We need 5 replicas for read traffic. Simple, no? So the first scaling plan is: - stateless app tier with 10 nodes behind a load balancer - one primary database for writes - five read replicas for reads What if next quarter writes jump to 8,000/sec? Read replicas do nothing for that. Now the primary is the bottleneck. That is when vertical scaling or sharding enters. Suppose the biggest safe primary gives 5,000 writes/sec. Still not enough. So we shard. If we shard by tenant_id across 4 primaries, write capacity becomes roughly 4 × 5,000 = 20,000 writes/sec. That gives headroom. See the sequence? Stateless app nodes first. Read replicas next for read-heavy pain. Sharding only when write capacity or data size forces it. That order saves teams from premature pain.

Where this lives in the wild¶

Netflix API tier — platform engineer keeps request handling stateless so auto-scaling can add or remove instances during traffic surges.
GitHub repository pages — database engineer lean on read replicas so heavy page loads do not push every read onto the write primary.
Slack workspace platform — backend engineer eventually needs tenant-aware routing because one shared cluster cannot hold every team's growth forever.
Uber trip storage — infra engineer partitions data by region or trip identity so one city's spike does not consume one database node alone.
Discord session-serving edge — platform engineer avoids sticky session dependence so connections can rebalance and recover faster during failures.

Pause and recall¶

Why is vertical scaling often the first move even though it does not scale forever?
Why does horizontal scaling require stateless services?
When do read replicas help, and what problem do they not solve?
Why can hash sharding distribute load better while making some queries worse?

Interview Q&A¶

Q: Why choose horizontal scaling instead of only buying a bigger machine forever? A: Bigger boxes are simple early, but the cost curve rises and hardware ceilings arrive. Horizontal scaling is how stateless tiers keep growing past one machine.

Common wrong answer to avoid: "Because horizontal is cheaper" — sometimes it is not. The real issue is ceiling, resilience, and growth shape.

Q: Why insist on stateless services instead of using session affinity? A: Stateless nodes let any instance serve the next request. Sticky sessions trap load, complicate failover, and make elastic scaling much less effective.

Common wrong answer to avoid: "Because sessions are bad" — sessions are fine. Hiding them inside one app node is the problem.

Q: Why add read replicas before jumping to sharding in many systems? A: Because reads often dominate first, and replicas preserve a single write authority. Sharding is much more invasive operationally.

Common wrong answer to avoid: "Because sharding is only for huge companies" — sharding is for huge write or size pressure, regardless of company logo.

Q: Why choose hash sharding instead of range sharding for some user-centric workloads? A: Hashing spreads load more evenly when adjacent ids would otherwise create hotspots. You trade away clean range locality for balance.

Common wrong answer to avoid: "Because hash sharding is always better" — if your main queries are ordered scans, range may still be the right trade.¶

Apply now (5 min)¶

Exercise: Take one service you know. Write current total QPS, read percentage, and write percentage. Then compute app-node count and replica count for 5x traffic. State whether statelessness is already true. Sketch from memory: Draw load balancer, app nodes, primary DB, and replicas. Then draw a second version with four shards. Mark where session state lives.

Bridge. We scaled across multiple machines. Data lives on multiple nodes now. But what happens during a network partition? Which nodes are right? That is the CAP question. → 10-cap-theorem-reality.md