03. Back-of-Envelope Math — Turn vague scale into real pressure¶

~16 min read. "Millions of users" becomes useful only after division and multiplication.

Built on the ELI5 in 00-eli5.md. Each order ticket — one request moving through the system — creates load you can count, size, and budget.

1) Why napkin math matters before architecture¶

Candidates often do one of two bad things. They either avoid numbers completely. Or they throw random giant numbers around. Neither helps. Math is not a side show. Math is the bridge from requirements to design. Without math, words like scale and high traffic stay fuzzy. With math, bottlenecks start appearing. See this simple flow.

requirements
   │
   ├── users
   ├── actions per user
   ├── data per action
   ├── latency target
   │
   ▼
napkin math
   │
   ├── QPS
   ├── storage growth
   ├── network bandwidth
   ├── peak vs average
   │
   ▼
architecture choices

Simple, no? The goal is not perfect precision. The goal is correct order of magnitude. If your estimate says 500 QPS, a single machine might work. If your estimate says 500,000 QPS, the design conversation changes. If storage is 50 GB per year, one database is fine. If storage is 50 PB per year, partitioning is not optional. That is why math comes before clever components. You are trying to expose the pressure points.

2) Start with QPS, not with machines¶

The most common first estimate is throughput. How many requests per second hit the system? Start from user behavior. Not from server guesses. Here is the base formula. Average QPS = total requests per day ÷ 86,400. Peak QPS = average QPS × peak factor. Peak factor depends on the product. For global consumer products, 3× to 10× is a useful first cut. For event-driven spikes, peak can be much higher. Look. Always separate reads and writes. They usually stress different components. If the prompt says 10 million DAU. And each user makes 6 reads per day. Total daily reads = 10,000,000 × 6. = 60,000,000 reads per day. Average read QPS = 60,000,000 ÷ 86,400. ≈ 694 reads per second. Assume peak factor = 5. Peak read QPS ≈ 694 × 5. ≈ 3,470 reads per second. Now do writes separately. If each user makes 0.5 writes per day. Total daily writes = 10,000,000 × 0.5. = 5,000,000 writes per day. Average write QPS = 5,000,000 ÷ 86,400. ≈ 58 writes per second. Peak write QPS ≈ 58 × 5 = 290 writes per second. See the shape now? This is read-heavy. That suggests cache might matter more than write batching. The math is already whispering design hints.

3) Storage math is just rate × time¶

Next, ask how much data is stored. Again, start from one unit. How much data does one order ticket create? A post. A message. A photo metadata row. A click event. Whatever the unit is, size it. Then multiply. The simplest formula is this. Storage per period = objects per period × bytes per object. Then remember replication separately. Logical storage is not physical storage. If one object is 2 KB. And you create 5 million objects per day. Daily logical storage = 5,000,000 × 2 KB. = 10,000,000 KB. ≈ 10 GB per day. Yearly logical storage ≈ 10 GB × 365. ≈ 3,650 GB. ≈ 3.65 TB per year. If replication factor = 3. Physical storage ≈ 3.65 TB × 3. ≈ 10.95 TB per year. Now what is the problem? Candidates often forget metadata, indexes, and replicas. So their storage numbers look suspiciously small. A safer habit is this. Estimate logical data. Then add index overhead. Then add replication. Then add room for growth. That is much closer to reality.

4) Bandwidth math exposes hidden pain¶

Many designs fail on network before database size looks scary. Especially feeds. Especially media products. Especially fanout systems. Bandwidth math is simple. Bandwidth = requests per second × bytes per response. If peak reads are 20,000 QPS. And each response is 50 KB. Peak outbound bandwidth = 20,000 × 50 KB. = 1,000,000 KB per second. ≈ 1,000 MB per second. ≈ 1 GB per second. That is serious. Now imagine media thumbnails are embedded and the response becomes 200 KB. Bandwidth jumps to about 4 GB per second. Same QPS. Different payload. Huge consequence. See why response size matters? One more rule. Convert bits and bytes carefully. 8 bits = 1 byte. 1 MB is roughly 10^6 bytes in rough interview math. If you want binary precision, 1 MiB = 2^20 bytes. Use whichever is cleaner. Just stay consistent.

5) The powers of two and latency numbers worth carrying¶

You do not need a memory palace. You need a few anchors. Useful powers of two: 2^10 ≈ 1 thousand. 2^20 ≈ 1 million. 2^30 ≈ 1 billion. 2^40 ≈ 1 trillion. So what to do? Use these to move quickly between bytes and bigger units. 1 KB ≈ 2^10 bytes. 1 MB ≈ 2^20 bytes. 1 GB ≈ 2^30 bytes. 1 TB ≈ 2^40 bytes. A few latency anchors also help. CPU cache access is tiny. Memory access is much slower than cache. SSD access is much slower than memory. A network round trip inside one region is slower again. A cross-continent trip is far slower. You do not need exact nanoseconds in most interviews. You need the ordering.

fastest ──→ slowest
CPU cache → memory → SSD → local network → cross-region network

If a design adds a remote network hop on every read, latency budget changes. If a design serves from memory, latency is far lower. That is the mental use of latency numbers. Not trivia.

Prompt: design the home feed for a social app. Assume these house rules. 50 million DAU. Each DAU opens the app 4 times per day. Each feed open returns 20 posts. Each post payload in the feed is 2 KB after basic metadata packing. Each DAU creates 1 post every 5 days on average. Peak factor is 6. Let us compute writes first. Posts per day = 50,000,000 ÷ 5. = 10,000,000 posts per day. Average write QPS = 10,000,000 ÷ 86,400. ≈ 116 writes per second. Peak write QPS ≈ 116 × 6. ≈ 696 writes per second. Now feed reads. Feed opens per day = 50,000,000 × 4. = 200,000,000 feed reads per day. Average read QPS = 200,000,000 ÷ 86,400. ≈ 2,315 reads per second. Peak read QPS ≈ 2,315 × 6. ≈ 13,890 reads per second. Now response size. One feed response = 20 posts × 2 KB. = 40 KB. Peak outbound bandwidth = 13,890 × 40 KB. = 555,600 KB per second. ≈ 556 MB per second. ≈ 0.56 GB per second. Now storage for one year of posts. One post object = 2 KB in feed-visible metadata. Logical yearly post storage = 10,000,000 × 2 KB × 365. = 7,300,000,000 KB. ≈ 7,300,000 MB. ≈ 7,300 GB. ≈ 7.3 TB. If replication factor = 3. Physical storage ≈ 7.3 × 3 = 21.9 TB. And this is only feed metadata. Not photos. Not videos. Not indexes. Not comments. See. Even with rough math, three design hints appear. Read path is far hotter than write path. Feed payload size drives serious network use. Media itself must live outside the main feed database. That is why napkin math matters. It tells you where the kitchen gets hot.

7) Common estimation mistakes¶

Mistake one. Using monthly users directly for QPS. MAU helps with broad reach. DAU is usually better for active traffic estimates. Mistake two. Forgetting peak factor. Average traffic rarely breaks systems. Peak traffic does. Mistake three. Ignoring read and write asymmetry. Many products are extremely skewed. Mistake four. Counting logical storage but forgetting replicas and indexes. Mistake five. Not checking whether the answer even feels plausible. If a global app estimate gives 2 QPS, something is off. If a tiny internal tool estimate gives 5 million QPS, something is off. Do a smell test. Simple, no?

Where this lives in the wild¶

Instagram home feed — staff infrastructure engineer: estimates read amplification, fanout volume, and payload size before choosing cache layers.
WhatsApp message service — senior backend engineer: sizes daily sends, message durability, and media bandwidth before discussing queue partitions.
Netflix title page — performance engineer: calculates payload bytes and cache hit needs before changing edge strategy.
Razorpay ledger events — principal engineer: derives peak write rate, retention storage, and replay bandwidth before database planning.
Amazon product detail page — senior platform engineer: models read skew, seasonal peak factor, and image payload cost before CDN expansion.

Pause and recall¶

Why is peak QPS usually more useful than average QPS for architecture choices?
What three numbers do you need to estimate storage growth quickly?
Why should read QPS and write QPS be estimated separately?
In the social feed example, what two variables most strongly affected bandwidth?

Interview Q&A¶

Q: Why estimate from user actions, not from guessed server counts? A: User behavior is the source of demand, so it gives you a defensible starting point for QPS, storage, and bandwidth. Server count should fall out of that math later; treating it as an input hides the reasoning you are supposed to show.

Common wrong answer to avoid: "Because interviewers prefer product metrics over infrastructure metrics." — This is not about presentation style; it is about starting from the actual cause of system load.

Q: Why carry a peak factor, not design only for average traffic? A: Real systems break during bursts, launches, and daily spikes rather than during calm average periods. A peak factor gives you a simple way to approximate that pressure without pretending you know the exact traffic curve.

Common wrong answer to avoid: "Average traffic is enough if autoscaling exists." — Autoscaling reacts with delay, and some bottlenecks appear before extra capacity arrives.

Q: Why treat response bytes as a first-class number, not just QPS? A: QPS tells you how often requests happen, but payload size tells you how much network and storage work each request creates. Two services with identical QPS can have completely different bandwidth cost, latency, and CDN implications if object sizes differ.

Common wrong answer to avoid: "If QPS is reasonable, bandwidth will usually be reasonable too." — Small request counts can still overwhelm the network when each request carries a large payload.

Q: Why is rough order-of-magnitude math enough in interviews, not exact spreadsheets? A: Interview math is meant to expose pressure points and guide architecture, not produce finance-grade forecasting. Fast, sensible estimates are better than slow fake precision because they let you explain tradeoffs while keeping the discussion moving.

Common wrong answer to avoid: "Exact numbers matter less because system design is mostly qualitative." — The math still matters; it just needs to be directionally correct enough to shape decisions.¶

Apply now (5 min)¶

Take the prompt: design a photo sharing app. Assume your own DAU, reads per user, writes per user, and object size. In five minutes, compute four things. Average read QPS. Peak read QPS. Yearly logical storage. Peak outbound bandwidth. Then sketch from memory: - the four formulas you used - one assumption that most changed the answer - one design hint the math suggests

If you can do that quickly, your whiteboard will stop being guesswork.¶

Bridge. Numbers are clear. Now put something on the whiteboard. Start with one box. See how far it goes. → 04-single-box-to-breakdown.md