12. Latency and throughput budgets — speed is a budget, not a wish¶

~17 min read. "Fast enough" becomes real only when every millisecond gets a job.

Built on the ELI5 in 00-eli5.md. The house rules — latency and uptime promises for the whole restaurant — now become numbers for each order ticket.

1) First picture: every request spends time somewhere¶

See. A slow system is rarely slow in one giant lump. Time leaks across many small steps.

user ─→ DNS ─→ TLS ─→ gateway ─→ app ─→ cache/DB ─→ downstream ─→ response
        15ms   20ms    10ms      25ms      60ms         80ms          20ms

That is the latency waterfall. The full request time is the sum of many pieces. If you do not budget each piece, one hop quietly eats the whole meal.

Look. Teams love averages. Users feel tails. If 99 users get a fast page and 1 user waits 3 seconds, the average may look decent. That one user still thinks the product is broken.

So what to do? Track percentiles. Track the waterfall. Track the fan-out. Simple, no?

2) P50, P95, P99: why averages lie¶

Suppose you sampled 100 requests. Their latencies are:

90 requests at 80 ms
8 requests at 250 ms
1 request at 900 ms
1 request at 3,000 ms

Now calculate the average.

90 × 80 = 7,200
8 × 250 = 2,000
1 × 900 = 900
1 × 3,000 = 3,000
total = 7,200 + 2,000 + 900 + 3,000 = 13,100 ms
average = 13,100 ÷ 100 = 131 ms

Average says 131 ms. That sounds fine. Now look at percentiles.

P50 = the 50th request = 80 ms
P95 = the 95th request = 250 ms
P99 = the 99th request = 900 ms
worst case here = 3,000 ms

See the story? The mean hides the pain. A product manager hears 131 ms and smiles. A user stuck at 900 ms does not.

The house rules should usually name a percentile. For user-facing APIs, P95 and P99 matter much more than mean. For batch systems, throughput may matter more. Context first.

3) Build the latency waterfall before tuning anything¶

Look. You need an end-to-end budget. Then you slice it.

Worked example. Suppose checkout must respond in 300 ms at P95. One possible budget is:

DNS and TLS = 25 ms
load balancer and gateway = 15 ms
auth check = 20 ms
application logic = 30 ms
cache lookup = 5 ms on hit
database query on miss = 70 ms
fraud service = 80 ms
serialization and response write = 15 ms

If this request is a cache miss, total is:

25 + 15 + 20 + 30 + 70 + 80 + 15 = 255 ms

Remaining headroom is:

300 - 255 = 45 ms

That 45 ms is precious. It absorbs jitter, network variance, and normal noise. Without headroom, a tiny slowdown becomes an SLO miss.

┌───────────────────────┬────────┐
│ hop                   │ budget │
├───────────────────────┼────────┤
│ client + network      │ 25 ms  │
│ edge + gateway        │ 15 ms  │
│ auth                  │ 20 ms  │
│ app logic             │ 30 ms  │
│ DB / cache            │ 70 ms  │
│ downstream fraud      │ 80 ms  │
│ response write        │ 15 ms  │
└───────────────────────┴────────┘

So what to do? Write the waterfall first. Then argue about optimizations. Otherwise teams optimize the loudest service, not the slowest path.

4) Tail latency amplifies when one request fans out¶

Now what is the trap in distributed systems? One order ticket often touches many prep stations. Even if each one is mostly fast, the combined chance of one slow call rises.

Suppose each downstream service has a 1% chance of being slow. So each one is fast 99% of the time.

If your request calls 1 service:

probability all are fast = 0.99
probability at least one is slow = 1 - 0.99 = 1%

If your request calls 8 services:

probability all are fast = 0.99^8
0.99^8 ≈ 0.9227
probability at least one is slow = 1 - 0.9227 ≈ 7.7%

If your request calls 20 services:

probability all are fast = 0.99^20
0.99^20 ≈ 0.8179
probability at least one is slow = 1 - 0.8179 ≈ 18.2%

Simple, no? More fan-out means more tail pain. That is why deep service graphs feel slower than single-box designs, even when every team says, "our service is fast."

The kitchen may have many excellent cooks. The plate is late if one station delays the order.

Now connect this to throughput. Suppose a service has 40 workers. Average service time is 20 ms. Theoretical max throughput is:

throughput = workers ÷ service time
throughput = 40 ÷ 0.02 = 2,000 requests per second

If traffic is 1,800 requests per second, utilization is:

1,800 ÷ 2,000 = 90%

At 90%, a tiny slowdown hurts badly. If service time rises from 20 ms to 25 ms, new capacity is:

40 ÷ 0.025 = 1,600 requests per second

Now arrival rate is 1,800. Capacity is 1,600. Queueing starts. Latency rises before error rate does. See why throughput and latency fight each other?

5) SLA, SLO, SLI, and the error budget¶

Look. These terms sound similar. They do different jobs.

SLI is the measured indicator. Example: percentage of requests under 300 ms.
SLO is the target you aim for. Example: 99% of requests under 300 ms each month.
SLA is the external promise, often with penalties. Example: 99.9% availability in the customer contract.

Now the useful idea. Error budget. If your SLO is 99.9% success for a 30-day month, allowed failure is 0.1%.

Thirty days has:

30 × 24 × 60 = 43,200 minutes

Allowed unavailable time is:

43,200 × 0.001 = 43.2 minutes

That is the monthly error budget. Use it carefully. If you burn it too fast, stop risky launches. If you are far below it, you may accept more change velocity.

Another example. Suppose you serve 50 million requests per month. Your latency SLO says 99% must finish under 300 ms. Allowed slow requests are:

50,000,000 × 0.01 = 500,000 requests

That number makes the tradeoff concrete. One bad deployment might burn a huge chunk quickly. The house rules become operational, not decorative.

Where this lives in the wild¶

Google Search frontend — a web performance engineer watches P99 render latency, not mean latency, because one slow tail still breaks the user’s search session.
DoorDash checkout — a reliability engineer budgets promo, tax, payment, and restaurant-availability calls inside one strict request waterfall.
Slack huddle join flow — a realtime infrastructure engineer tracks tail amplification across auth, room state, media allocation, and notification services.
Stripe API platform — an SRE uses error-budget burn to decide whether risky deploys should pause after reliability regressions.
Disney+ Hotstar live streaming — a streaming performance engineer trades buffering, concurrency, and throughput carefully during huge sports spikes.

Pause and recall¶

Why does the average latency hide user pain?
In the percentile example, what did P99 reveal that the mean did not?
Why does calling more downstream services amplify tail latency?
How do SLI, SLO, SLA, and error budget differ?

Interview Q&A¶

Q: Why optimize for P99, not just average latency, in a user-facing system? A: Because users experience individual requests, not the arithmetic mean. Tail latency captures the slow experiences that drive abandonment and complaints.

Common wrong answer to avoid: "Average is enough if traffic is high" — more traffic can actually hide worse tails inside a nice-looking mean.

Q: Why assign per-hop latency budgets, not only one end-to-end target? A: Because teams need local guardrails. A single global number does not tell auth, database, or downstream owners how much delay they are allowed to spend.

Common wrong answer to avoid: "The slowest hop will be obvious anyway" — without a budget, multiple moderate delays can quietly combine into failure.

Q: Why can higher throughput make latency worse? A: Because as utilization approaches capacity, queues grow quickly and waiting time dominates service time. A small slowdown can push the system into saturation.

Common wrong answer to avoid: "Throughput and latency improve together" — sometimes they do, but near saturation they often trade off sharply.

Q: Why use SLOs and error budgets, not only a customer-facing SLA? A: Because SLOs guide daily engineering decisions before the contract is at risk. Error budgets turn reliability into an operating mechanism, not just a legal number.

Common wrong answer to avoid: "SLA already covers reliability" — SLA is too coarse and external to guide most internal decisions.

Apply now (5 min)¶

Take one API you know. Write a 300 ms end-to-end budget. Split it across network, gateway, app logic, storage, and one downstream call. Then compute how much headroom remains.

Sketch from memory:

the latency waterfall,
the percentile example with 100 requests,
and the fan-out probability formula.

Bridge. The design is solid technically. But can you walk the interviewer through it without rambling? The best architecture fails the interview if you cannot communicate it. → 13-structured-walkthrough.md