04. Single Box to Breakdown — One server works, until physics objects¶

~15 min read. Before distributed systems, learn how one machine fails honestly.

Built on the ELI5 in 00-eli5.md. The kitchen — the compute that does the real work — starts as one box, and then hits hard limits.

1) Start with one server on purpose¶

Many candidates jump too quickly to distributed design. Load balancers. Shards. Queues. Replicas. Microservices. Look. That is backwards. A senior design usually starts with one machine. Why? Because one box teaches you the workload. It also gives you a baseline. If one box can handle the traffic, stop there. If one box fails, ask exactly why. That is how clean scaling stories begin. The mental model is simple.

┌───────────────────────────┐
│        one server         │
├──────────┬────────────────┤
│ CPU      │ does compute   │
├──────────┼────────────────┤
│ Memory   │ holds hot data │
├──────────┼────────────────┤
│ Disk     │ stores state   │
├──────────┼────────────────┤
│ Network  │ moves bytes    │
└──────────┴────────────────┘

Every bottleneck is one of these four. Simple, no? When a machine feels slow, ask which resource is saturated. Not which trendy technology is missing. A server does not fail because it is lonely. It fails because one resource runs out before the others.

2) CPU saturation: the box is busy thinking¶

CPU is consumed by computation. Request parsing. Business logic. Encryption. Compression. Serialization. Ranking. Image transforms. Anything compute-heavy burns CPU. Now what is the problem? CPU bottlenecks are easy to miss when latency still looks acceptable at low load. Then traffic climbs. Context switching rises. Queueing inside the process rises. P99 latency jumps. Signs of CPU saturation are common. High utilization. Higher response times without larger payloads. Throughput flattening even though more requests arrive. You may also see one hot endpoint dominating cycles. Example. Suppose one request takes 20 ms of CPU time. One CPU core can provide about 1,000 ms per second. Requests per second per core ≈ 1,000 ÷ 20. = 50 requests per second per core. With 8 cores, rough upper bound ≈ 8 × 50. = 400 requests per second. That is the compute ceiling for that endpoint. If traffic grows to 800 requests per second, the box does not politely adapt. It queues. Latency stretches. Timeouts appear. See. This is why CPU math matters. Not because interviews love formulas. Because formulas explain the break.

3) Memory exhaustion: the box forgets too much¶

Memory holds what the server wants quickly. Active processes. Connection state. In-memory caches. Working sets. Buffers. Now imagine the hot data no longer fits. Then the system starts doing extra work. Maybe cache hit rate collapses. Maybe garbage collection pauses increase. Maybe the operating system swaps. That is ugly. Disk is far slower than memory. So a memory problem often becomes a latency problem. Suppose your app keeps 200 KB of active session data per connected client. And you support 100,000 concurrent clients. Required memory = 200 KB × 100,000. = 20,000,000 KB. ≈ 20,000 MB. ≈ 20 GB. Now add application overhead. Now add cache. Now add headroom. A 32 GB server starts looking much smaller. Look. Memory is not only about capacity. It is also about fit. If the hot working set fits in RAM, life is good. If it spills to disk or remote fetches, your kitchen slows sharply. That is why caching and memory budgets are always linked.

4) Disk I/O: the box can think, but cannot fetch fast enough¶

Disk bottlenecks happen when persistent reads or writes pile up. Databases do this. Log-heavy systems do this. Analytics boxes do this. Image or video processing pipelines do this. People often think only about disk capacity. Wrong focus. Disk usually hurts first on IOPS or throughput. Many small random reads can kill latency. Many fsync-heavy writes can kill throughput. Suppose your service writes 5 KB to disk per request. Traffic is 10,000 writes per second. Write volume per second = 5 KB × 10,000. = 50,000 KB per second. ≈ 50 MB per second. That may sound manageable. But if each write forces sync and metadata updates, IOPS may become the real limit. So what to do? Separate the two questions. How many bytes per second? How many operations per second? A disk can handle big sequential writes far better than tiny scattered ones. Same storage medium. Very different behavior.

5) Network bandwidth: the box spends its life moving bytes¶

If CPU computes and disk persists, network delivers. Every request and every response consumes bandwidth. Replication consumes bandwidth. Client uploads consume bandwidth. Inter-service calls consume bandwidth. CDN misses consume bandwidth. Network bottlenecks are sneaky. The server may have spare CPU and memory. Still, large responses can saturate the link. Example. Assume peak traffic is 6,000 responses per second. Each response is 100 KB. Outbound bandwidth = 6,000 × 100 KB. = 600,000 KB per second. ≈ 600 MB per second. ≈ 4.8 Gbps. A 1 Gbps NIC cannot carry that. A 10 Gbps NIC might. Same app logic. Different network ceiling. See how a single server breaks for totally different reasons? That is why the four-resource model is powerful. You do not say, "The server is slow." You say, "Network is saturated on large feed responses." That is actionable.

6) Worked example: when one web app box breaks¶

Assume we run a simple web app on one server. Specs: 8 CPU cores. 32 GB RAM. 1 TB SSD. 1 Gbps network link. Traffic assumptions: 2,000 requests per second peak. Average response size = 60 KB. Each request needs 8 ms of CPU time. Working set = 18 GB. Disk writes = 2 KB of logs and data per request. Let us test each resource.

CPU check¶

Per core capacity ≈ 1,000 ms ÷ 8 ms. = 125 requests per second per core. With 8 cores, upper bound ≈ 8 × 125. = 1,000 requests per second. But peak load is 2,000 requests per second. CPU already breaks first.

Memory check¶

Working set is 18 GB. Server RAM is 32 GB. Add app runtime, OS, page cache, and connection buffers. Still likely okay. Memory is tight but not first to fail.

Disk check¶

Disk write rate = 2 KB × 2,000. = 4,000 KB per second. ≈ 4 MB per second. That is fine for throughput. IOPS might also be acceptable here. Disk is not first.

Network check¶

Outbound traffic = 60 KB × 2,000. = 120,000 KB per second. ≈ 120 MB per second. ≈ 0.96 Gbps. That is almost the full 1 Gbps link. So network is also near the edge. Conclusion. This one box fails first on CPU. Network is second. If traffic rises even a little, both become risky. Now vertical scaling might help. A larger box with 16 cores and 10 Gbps NIC may extend life. But it does not remove the pattern. One machine still has ceilings.

7) Vertical scaling helps, but only for a while¶

Vertical scaling means buying a bigger box. More cores. More RAM. Faster SSD. Fatter NIC. This is often the right first move. It is simple operationally. No distributed coordination yet. No partitioning logic yet. No cache invalidation gymnastics yet. Look. Vertical scaling is not bad. It is just finite. Costs rise non-linearly. Hardware sizes top out. A single box is still one failure domain. Maintenance is still disruptive. One giant database host can still become the critical risk. So what to do? Use vertical scaling to buy time. Use the time to measure the real bottleneck. Then split only the constrained part. Maybe add more stateless app servers. Maybe move static assets to object storage and CDN. Maybe separate read replicas. Maybe isolate background jobs. But start by knowing why one box broke. Simple, no?

Where this lives in the wild¶

Shopify storefront app tier — staff performance engineer: checks whether CPU, memory, or network breaks first before adding more pods.
Figma document server — senior realtime engineer: watches memory growth and network fanout before deciding when a single node must split rooms.
Notion page render backend — principal engineer: measures CPU-heavy serialization versus cache fit before scaling horizontally.
Razorpay API gateway host — senior platform engineer: validates NIC saturation and TLS CPU cost before fleet expansion.
GitHub monolith page serving — staff infrastructure engineer: isolates which single-host resource is actually red before changing topology.

Pause and recall¶

What are the four physical resources that explain most single-server bottlenecks?
Why is "the server is slow" a weak diagnosis?
In the worked example, which resource failed first and which one was close behind?
Why can vertical scaling be correct even though it is not the final scaling story?

Interview Q&A¶

Q: Why start with one server, not a distributed cluster immediately? A: A single-server baseline shows the natural bottleneck before you add coordination, replication, and network complexity. Once you know what actually broke, you can scale in a way that solves the real limit instead of cargo-culting a bigger diagram.

Common wrong answer to avoid: "Because interviewers prefer simple answers first." — Simplicity is useful here because it reveals evidence, not because the interviewer dislikes advanced systems.

Q: Why separate CPU, memory, disk, and network, not just talk about throughput? A: Throughput is the symptom you observe, but resource saturation is the reason the system stops scaling. Since each resource fails differently, naming the exact constraint leads to much better scaling choices and tradeoff discussion.

Common wrong answer to avoid: "Throughput already includes all resource behavior, so the breakdown is unnecessary." — Aggregate throughput hides the cause, which is the part you need in order to fix the bottleneck.

Q: Why can a server with plenty of CPU still fail under load? A: CPU is only one ceiling; large payloads can saturate the network, random access patterns can stall disk, and memory pressure can wreck cache behavior. If you do not name the constrained resource, low CPU usage can trick you into thinking the machine still has lots of safe headroom.

Common wrong answer to avoid: "If CPU is low, the server should be able to scale much further." — Headroom in one resource does not help once a different resource has already become the hard limit.

Q: Why choose vertical scaling first sometimes, not horizontal scaling immediately? A: Vertical scaling is often the fastest way to buy headroom while keeping the design and operations model simple. It lets you validate the bottleneck hypothesis before taking on distributed failure modes, coordination costs, and data-partitioning work.

Common wrong answer to avoid: "Horizontal scaling is always superior because cloud machines are cheap." — More nodes add operational and consistency complexity, so they are only better when that extra complexity buys something you actually need.¶

Apply now (5 min)¶

Take the prompt: design a photo upload API on one server. Invent four numbers. Peak requests per second. Average response or upload size. CPU milliseconds per request. Working set memory. Then sketch from memory: - the four-resource box - which resource fails first - one vertical scaling move you would try - one horizontal split you would consider next

If your answer names the resource clearly, your scaling story will sound grounded.¶

Bridge. One box broke. We need more. But more of what? There are only about 8 building blocks every system uses. → 05-building-blocks-toolkit.md