02. Latency anatomy — separate first-token silence from total completion time¶
~22 min read. A lead AI engineer does not optimize a single number; they make the tradeoff visible, measured, and reversible.
Builds on 00-eli5.md. The ETA call is the first user-visible sign of progress, while the meter ticks, dispatch board, and boot space influence the full wait.
What previous chapters solved before this pressure appears¶
Cost anatomy taught us to count the whole workflow instead of one API call. That solved invoice surprise, but it did not explain why two requests with the same token cost can feel completely different to users. A feature can be cheap and still unbearable if it spends too long in queue, prefill, tool setup, or decode. This chapter turns the same request boundary into a latency trace.
The accumulated lesson is already visible in the taxi fleet. meter ticks expose money, the ETA call exposes perceived wait, the dispatch board exposes route choice, and the fuel ledger keeps those choices honest. This file adds the next constraint without forgetting the earlier ones: every optimization relieves one pressure and creates another that some subsystem must absorb.
What this file solves¶
A research assistant shows a six-second average, but complaints cluster around “it just hangs.” This file shows how to split latency into queue time, retrieval/tool setup, prefill, time to first token, tokens per second, and final completion so the right team owns the right wait.
The opening failure shows up in a concrete artifact¶
The failure is not abstract: users abandon a feature even though average model latency improved after a model upgrade. Here is the early artifact a reviewer can inspect.
Trace excerpt for request research-219:
| Stage | p50 | p95 | Owner |
|---|---|---|---|
| gateway queue | 80 ms | 900 ms | serving |
| retrieval | 120 ms | 260 ms | search |
| prefill | 480 ms | 1,700 ms | prompt/runtime |
| first token | 720 ms | 2,950 ms | product experience |
| decode | 5.2 s | 12.6 s | output/runtime |
| post-process | 90 ms | 400 ms | app |
A smart team might try to fix the most visible line in that artifact. That is tempting, and it is incomplete. The root cause is metric collapse: one average latency number mixes waiting rooms that respond to different levers. So how do we name the room where the user is actually stuck? This is the root-cause pivot: not a local metric problem, a boundary-and-pressure problem.
A tiny version exposes the whole mechanism¶
A 300-token answer at 60 tokens/sec takes five seconds to decode. If prefill rises from 500 ms to 1,500 ms, total changes by one second but TTFT triples. The user experiences that as dead air, not merely as a slightly longer answer.
Rule: TTFT, tokens per second, queue time, and total latency are different product experiences and must be optimized separately.¶
Why this rule exists. Autoregressive generation has phases: waiting, reading the prompt, producing the first token, then producing the rest. The constraint is that each phase has a different bottleneck. A single average hides the root cause and leads teams to buy the wrong speedup. The fuel ledger matters because it shows whether the new pressure landed in cost, latency, memory, quality, or operator attention.
1) Split the request into rooms before prescribing a fix¶
Start with the workflow, not the vendor feature. In Maya's review, the team takes one request and follows it from API ingress to model call, tool call, runtime behavior, and resource consequence. That cross-layer trace is the shortest path from symptom to lever. If the symptom is cost, the trace follows meter ticks. If the symptom is silence, it follows the ETA call. If the symptom is serving pressure, it follows the carpool lane and boot space.
user request
│
▼
API gateway ── route/version ──► model/runtime
│ │
│ ├─ tokens / queue / KV / output
│ ▼
└──────── outcome ◄──────── fuel ledger row
The counterintuitive part is that the most obvious metric can improve while the product gets worse. A smaller bill can hide more failed outcomes. Higher tokens/sec can hide longer queueing. A shorter prompt can hide missing evidence. The mechanism in this chapter is useful only when the trace keeps the relieved pressure and the newly created pressure in the same picture.
2) The wait map users cannot see¶
Picture the chapter as a pressure transfer, not a free lunch.
Before optimization After optimization
┌──────────────────────┐ ┌──────────────────────┐
│ visible pain │ │ relieved pain │
│ cost / wait / memory │──change──► │ lower local metric │
└──────────┬───────────┘ └──────────┬───────────┘
│ │
▼ ▼
hidden cause not named new pressure appears
retries, route mix, context, quality, queueing, cache,
output, provider limits memory, fallback, ops
The diagram is the reason this module keeps returning to the fuel ledger. The ledger is where the second box becomes visible instead of surfacing as an invoice surprise, p99 incident, or quality complaint weeks later.
3) Maya explains why the research assistant feels frozen¶
Maya threads one workload through the design review: a production assistant with real traffic, route versions, prompt versions, and outcome labels.
Attempt A — optimize the visible line¶
The first attempt changes the local knob that seems responsible for users abandon a feature even though average model latency improved after a model upgrade. The local dashboard improves. The team celebrates too early because the request boundary is still broken: retries, quality loss, queueing, cache misses, or memory pressure move elsewhere.
Attempt B — optimize with the pressure chain¶
The second attempt keeps the artifact, the rule, and the guardrail together. Maya writes the expected improvement, the pressure that may worsen, the owner of that pressure, and the rollback trigger. The dispatch board may change a route, the memorized route may change a prompt prefix, or the carpool lane may change scheduling, but the same request ID proves whether the user outcome survived.
4) Why one mean latency chart loses to stage percentiles¶
The plausible alternative is attractive because it is simpler to explain in a status update: change one knob, quote one percentage, and move on. That works for demos. It fails for lead-level ownership because it cannot answer which workload benefits and which workload pays.
Use this chapter's mechanism when the workload has the shape named in the opening artifact. Use the alternative when the product is small enough, stable enough, or low-risk enough that the extra machinery would cost more than it saves. The decision is not about elegance; it is about whether the signal-to-operator cost is worth it.
5) Prompt-heavy and answer-heavy regimes need opposite levers¶
Concrete numbers make the tradeoff review honest. The sample prices and memory figures below are illustrative; replace them with the provider, hardware, and workload numbers in your own stack.
| Scenario | Fresh input | Cached input | Output | Extra condition | Lesson |
|---|---|---|---|---|---|
| Short prompt, concise answer | 120 | 480 | 300 | 60 tok/s | 5.6 s total / 0.6 s TTFT |
| Long prompt, same answer | 120 | 1400 | 300 | 60 tok/s | 6.5 s total / 1.52 s TTFT |
| Same prompt, long answer | 120 | 480 | 900 | 45 tok/s | 20.6 s total / 0.6 s TTFT |
| Traffic spike | 900 | 480 | 300 | 60 tok/s | 6.4 s total / 1.5 s TTFT |
| Tool-heavy agent | 120 | 480 | 300 | 60 tok/s | 8.2 s total / 0.6 s model TTFT |
The table teaches the design habit: every row says what improved and what might have worsened. If a row cannot name both, the proposal is not ready for production review.
6) The faster model that made dead air worse¶
Walk the failure from top to bottom. The user action enters the API. The application builds a prompt or route. The runtime spends tokens, queue time, cache memory, or output steps. The dashboard records a local improvement. Then the user-visible metric moves the wrong way.
That failure is not bad luck. It is what happens when the optimization changes one layer and the observation stops one layer too early. In a review, Maya asks for the missing link: where did the pressure go after the local metric improved? If nobody can answer, the change ships behind a small canary or does not ship.
7) Signals that reveal whether latency anatomy is healthy¶
- Healthy behavior: P95 TTFT and P95 total are inside the product budget by endpoint.
- First degrading metric: TTFT rises with prompt tokens before total latency looks alarming.
- Misleading beginner metric: mean latency, because it hides tails and phase shifts.
- Expert graph: stage-level percentile waterfall by route, prompt length, output length, and queue depth.
Mini-FAQ. "Why not watch the simplest metric?" Because the simplest metric is often the one the optimization directly manipulates. You need a paired guardrail that shows whether the system merely moved pain into another layer.
8) Boundaries where the chapter's lever works and where it turns pathological¶
- Strong fit: interactive features where silence, progress, and completion mean different things.
- Pathology: offline jobs where only batch completion time matters.
- Scale or workload limit: when client rendering, human review, or external tools dominate the visible path.
This boundary is not a disclaimer. It is a routing rule for engineering attention. The best optimization in one endpoint can be the wrong default for another endpoint with different latency tolerance, risk, context length, or outcome value.
9) Wrong mental model to replace¶
The tempting model is “latency is one blob.” The better model is room-by-room: queue, setup, prefill, first token, decode, post-process. A fix only works if it attacks the room that dominates the complaint.
The replacement model should change how you speak in design review. Do not say, "this reduces cost" or "this improves latency" without naming the request slice, expected magnitude, guardrail, and rollback trigger. Say which meter ticks, ETA call, carpool lane, or boot space pressure changed.
10) Other failure shapes you will recognize¶
- P99 failure. P99 queueing hidden by a healthy p50.
- Long failure. long outputs blamed on prefill.
- Retrieval failure. retrieval latency excluded from feature latency.
- Client-side failure. client-side rendering stalls mistaken for model slowness.
- Streaming failure. streaming enabled but first useful content delayed.
- Batch failure. batch windows widened globally across interactive endpoints.
- Tool failure. tool retries counted as model latency instead of workflow latency.
11) Cross-topic reinforcement — the same pressure shape returns¶
- Cost anatomy uses the same workflow boundary; latency anatomy changes the unit from dollars to time.
- Prompt caching attacks prefill and fresh-input cost together.
- Streaming helps perceived latency even when total decode time remains.
- Batching improves throughput by intentionally adding queue time, so this trace decides where it is safe.
12) Design-review questions that catch shallow plans¶
- Can you name which stage dominates p95 complaints?
- Can you separate first raw token from first useful value?
- Can you slice latency by prompt length, output length, route, and endpoint?
- Can you state which user-facing budget batching or routing is allowed to spend?
Where this shows up in production¶
- Enterprise support bot — turns route, token, cache, retry, and outcome rows into cost per resolved ticket rather than model spend per message.
- Coding assistant — separates inline completions from agentic edits because typing flow, repo context, and repair loops have different budgets.
- Search answer product — pays for rewriting, retrieval, reranking, synthesis, citations, and judge calls as one user-visible answer.
- Voice assistant — treats dead air, cancellation, and local fallback as product features because users notice 100 ms gaps.
- Back-office summarizer — uses larger queues and batches because humans care about daily throughput more than first-token immediacy.
- Commerce assistant — protects purchase-changing actions with stronger routes while letting read-only advice run cheaper.
- Internal data copilot — attributes spend by tenant, dataset, prompt version, and tool path so one team cannot hide another team's budget.
- Education tutor — spends tokens on safety and pedagogy rules, then watches whether shorter answers still teach well.
- Legal review workflow — keeps evidence and citation context even when compression pressure is high because unsupported claims are worse than cost.
- Healthcare intake helper — uses conservative routing and buffered streaming because safety checks are part of the latency path.
- Marketing content tool — controls output length and variant count because creative generation can silently explode spend.
- Incident-response copilot — prefers predictable latency and logs over clever savings during high-severity operations.
Recall — rebuild latency anatomy from memory¶
- What concrete failure opened this chapter, and which artifact made it inspectable?
- What root cause made the naive fix insufficient?
- State the rule in one sentence without using vendor language.
- Which pressure does the mechanism relieve, and which new pressure can it create?
- Which operational signal degrades first when the mechanism is misapplied?
- Where is the boundary where this lever becomes pathological?
- How does this chapter reuse the fuel ledger or dispatch board from earlier chapters?
- What would you put in the rollback trigger for this optimization?
Interview Q&A¶
Q: Why optimize TTFT and total latency separately?
A: TTFT captures silent time before visible progress; total latency captures task completion. A change can improve one while leaving the other unchanged, so separate metrics point to separate levers.
Common wrong answer to avoid: Average latency already includes both.
Q: When does output length dominate latency?
A: When decode time, roughly output tokens divided by tokens per second, is larger than queue plus prefill. Verbose answers can be slow even with excellent TTFT.
Common wrong answer to avoid: A fast first token means the feature is fast.
Q: Why include tool and retrieval time in the same trace?
A: Users experience the whole workflow. If retrieval or tools dominate, model-only optimization will not move product latency enough.
Common wrong answer to avoid: LLM latency is the same as feature latency.
Q: Why do percentiles matter more than one mean?
A: Tails reveal overload, long prompts, slow tools, and route failures that averages smooth away. P95 and P99 often match user complaints.
Common wrong answer to avoid: If the mean improved, users must be happier.
Q: How does prompt length affect TTFT?
A: Longer prompts increase prefill work because the runtime must read input and build attention state before decoding. Caching or compression can reduce that phase.
Common wrong answer to avoid: Prompt length only affects cost.
Q: Why can batching worsen TTFT while improving throughput?
A: A request may wait for a batch window before prefill begins. The GPU becomes more efficient, but the user sees extra queueing.
Common wrong answer to avoid: Batching only helps latency.
Q: What is first useful token?
A: It is the first chunk that reduces user uncertainty: answer, decision, citation, or honest progress event. Raw whitespace or boilerplate does not count.
Common wrong answer to avoid: The first byte from the server is always useful.
Apply now (10 min)¶
Step 1 — model the exercise. Use the modeled trace above for one feature. Then estimate queue, retrieval/tool setup, prefill, first useful token, decode, and post-process for your own endpoint. Reproduce the total twice from memory: once after cutting prompt size by 40%, once after cutting output length by 40%, and name which optimization matches the dominant regime.
Step 2 — your turn. Pick a real LLM feature and write the same artifact with your own rough numbers. Name the pressure relieved, the pressure created, the owner, and the metric that would prove the change unsafe.
Step 3 — reproduce from memory. Close the file and redraw the two diagrams: request trace and pressure transfer. Then restate the rule and the first degrading metric without looking.
What you should remember¶
This chapter explained why the opening failure is not solved by changing one local knob. The useful move is to make the request boundary inspectable, apply the topic rule, and watch the paired guardrail so the optimization cannot hide its cost in another subsystem.
You learned to describe the lever as pressure movement: what it relieves, what it creates, and which team or resource absorbs the new cost. That is the difference between a trick and an operating practice.
Carry the diagnostic forward: if the dashboard cannot show the artifact, the route or version, the user outcome, and the first degrading signal in one place, the optimization is not yet reviewable.
Remember:
- Latency is a chain of waits, not one number.
- TTFT controls dead air; decode controls answer duration; total latency controls task completion.
- Stage-level p95/p99 traces beat mean model latency.
- The ETA call is only good when it exposes useful progress.
- Every latency fix spends or saves time in a named room.
Bridge. We can now tell whether a request is slow because the model rereads too much prompt. The next chapter uses stable prompt prefixes — the memorized route — to reduce both fresh input cost and prefill time without changing the user task.