12. Load testing and benchmarking — how to measure serving honestly instead of believing one lucky run¶

~18 min read. A serving system without measurement discipline becomes folklore very quickly.

Built on the ELI5 in 00-eli5.md. The kitchen may look fast on one calm evening, but we need a repeatable stress test before we trust the service plan.

1) Picture first: a real benchmark has stages¶

A good benchmark is not one curl command. It has a warmup phase, a steady-state phase, and a clear request mix.

load test plan
   │
   ├── warmup ──→ caches settle, kernels compile, queues form
   ├── steady ──→ measure real tokens and latencies
   └── cooldown ──→ collect logs and errors

See. A benchmark is one planned dinner rush for the kitchen. Without warmup, you mix cold-start artifacts into the results. Without steady state, you benchmark only luck.

2) The core metrics¶

Track at least these metrics.

requests per second,
output tokens per second,
time to first token,
inter-token latency,
total completion time,
queueing delay,
P50, P95, and P99 latencies,
error rate.

Look. Throughput tells you capacity. Percentiles tell you pain. TTFT tells you when the plating line first moves. Queueing delay tells you when the batch window is too crowded. You need all three views. One number is never enough.

3) Worked example: read a small result table¶

Suppose a benchmark sends 100 requests. Each request generates 120 output tokens on average. The whole steady-state run lasts 40 seconds.

Output throughput is:

100 × 120 = 12,000 tokens total
12,000 / 40 = 300 output tokens per second

Now latency percentiles:

P50 total latency = 2.4 s
P95 total latency = 5.8 s
P99 total latency = 9.1 s

Simple, no? Median users are fairly happy. Tail users are suffering. If TTFT P95 is also high, new arrivals are waiting in queue before they see anything.

4) Methodology traps that create fake wins¶

Now what is the problem? Benchmarks lie easily. Common traps are:

using only one short prompt length,
hiding cold-cache effects,
measuring closed-loop traffic when production is open-loop,
ignoring streaming metrics,
averaging away long-tail latency,
skipping real concurrency mixes,
benchmarking a warmed single tenant only.

So what to do? Match the request mix to the product. Report full methodology beside the numbers. That is what makes a benchmark defensible.

5) How senior engineers read benchmark numbers¶

A mature reading asks: What was the prompt-length distribution? What was the output-length distribution? Was the cache warm? How many concurrent users? What hardware? What framework and quantization? What errors were excluded?

If those answers are missing, the score is not useless, but it is incomplete. This honest skepticism leads naturally to the final file: what inference optimization still cannot model neatly, even after good benchmarking.

Where this lives in the wild¶

vLLM benchmark suites — mixed prompt and output lengths reveal scheduler and KV behavior better than one toy prompt.
NVIDIA TensorRT-LLM performance reports — tokens per second must be read alongside topology and precision details.
GitHub Copilot-like internal load tests — TTFT percentiles matter because developer flow breaks on tails, not just means.
Perplexity-style answer systems — long outputs make inter-token cadence and completion percentiles important together.
Enterprise chatbot rollouts — error rate under burst load matters as much as raw throughput headlines.

Pause and recall¶

Why is a warmup phase important before steady-state measurement?
What does P99 latency tell you that average latency hides?
In the worked example, what was output throughput in tokens per second?
Why is a single short prompt benchmark usually misleading for production serving?

Interview Q&A¶

Q: Why report percentiles instead of only mean latency?

A: Because user pain lives in the tail. A nice average can hide severe slowdowns for the hardest or busiest requests.

Common wrong answer to avoid: "The mean summarizes everyone fairly." Tails are often what users remember.

Q: Why benchmark request mix, not only one canonical prompt?

A: Because serving behavior depends heavily on prompt length, output length, and concurrency mix. One neat prompt can flatter a system unrealistically.

Common wrong answer to avoid: "A representative prompt is enough." Production is a distribution, not one sample.

Q: Why track output tokens per second alongside requests per second?

A: Because requests vary in size. A system serving many tiny requests can look great on request throughput while doing little real generation work.

Common wrong answer to avoid: "RPS captures capacity fully." Token volume matters too.

Q: Why must benchmark methodology travel with the score?

A: Because hardware, precision, cache state, request mix, and concurrency shape the result. Without methodology, the number cannot be interpreted honestly.

Common wrong answer to avoid: "A benchmark headline is self-explanatory." It never is.

Apply now (5 min)¶

Invent a 50-request benchmark with two prompt lengths and two output lengths. Compute total output tokens and divide by test duration. Then write what P50 and P99 might mean for user experience. Sketch from memory:

the warmup-steady-cooldown plan,
the core metrics list,
and the percentile interpretation.

Bridge. Even honest benchmarks do not remove every uncertainty. Next we end with an honest admission: the open problems, couplings, and tradeoffs that still resist tidy optimization. → 13-honest-admission.md