06. Latency Budgeting — name every millisecond before it embarrasses you¶

~15 min read. Fast voice systems are designed with budgets, not with hope.

Built on the ELI5 in 00-eli5.md. The awkward pause — that uncomfortable silence after the user stops — becomes measurable only when the relay race is split into named stages.

First picture: budget the interpreter booth stage by stage¶

Picture the United Nations interpreter booth again. If the reply feels late, you do not say, "everything was bad." You ask sharper questions. Did the ear hear late? Did the brain think too long? Did the voice wait before speaking?

Did the relay race lose time between handoffs? That is latency budgeting. Latency budgeting means assigning milliseconds to named stages. Instead of one vague complaint, you get specific culprits. Simple, no?

user stops speaking
        │
        ▼
┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│ turn detect │─▶│   the ear   │─▶│  the brain  │─▶│  the voice  │─▶│  playback    │
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘
       │                │                │                │                │
       └──── named milliseconds in the relay race ──────────────────────────┘

The awkward pause feels emotional to the user. But operationally, it is arithmetic. The ear, the brain, and the voice each consume part of the delay. When you budget them, you stop arguing in fog.

The starting budget table¶

Here is a practical starting budget for many voice assistants. These are not holy numbers. They are useful defaults.

Stage	Budget range	Why this range matters
End-of-turn detection	200-400 ms	Short enough to feel responsive, long enough to avoid clipping slow speakers
STT finalization	100-250 ms	The ear should stabilize text quickly after speech ends
LLM TTFT	200-500 ms	The brain must begin early, not after a long silent think
TTS first audio	100-300 ms	The voice should start with little setup delay
Playback start	50-100 ms	The client needs a tiny buffer, not a large cushion
Total p95 target	700-1200 ms	The awkward pause should usually stay under about a second
Look. A budget is not a guess about averages. It is a promise
about stages. If your p95 target is 1000 milliseconds, and the
brain alone takes 900, you already know the rest of the stack
cannot save you. Yes?

This is why budgeting is liberating. It removes magical thinking.

Why percentiles beat averages¶

Many teams say, "average latency is fine." That sentence hides pain. Users do not experience averages. They experience actual turns, and the bad tail turns define trust. Track p50, p95, and p99. Do not lead with averages. p50 tells you the typical experience.

p95 tells you the slow-but-common pain. p99 tells you whether outliers are absurd. The awkward pause usually lives in the tail.

fast turns   fast turns   fast turns   slow turns   terrible turns
    │             │            │            │             │
    ├──── p50 ────┤                         ├── p95 ─────┤
                                                     ├─ p99 ─┤

Suppose ten calls feel great, and one call freezes badly. The average may still look decent. But the unhappy user remembers the freeze, not your spreadsheet mean. See. The relay race should be judged by reliable handoffs, not by lucky easy turns.

A small example helps.

Metric	Example value	What it tells you
Average	620 ms	Looks comforting, but hides ugly slow turns
p50	540 ms	Median user experience feels crisp
p95	1080 ms	Slow turns are noticeable but maybe acceptable
p99	1820 ms	Rare failures are still too painful
If p50 is good, but p95 is bad, your system is inconsistent. If
p50 and p95 are both bad, then the whole design needs help. If
p99 explodes while p95 looks fine, search for retry storms,
network jitter, or overloaded dependencies.

Optimization levers for each stage¶

Now let us make the budget actionable. Each stage has different levers. Do not attack every delay with the same hammer.

Stage	Common culprit	Useful levers
End-of-turn	Silence threshold too cautious	Tune VAD threshold, adapt by speaker speed, reduce forced wait
The ear	ASR decoder delay, large chunks	Send smaller chunks, use streaming ASR, reduce resampling overhead
The brain	Long prompt, slow model, tool waits	Shrink prompt, use smaller model, trim tool calls, prefetch context
The voice	TTS setup, large sentence batching	Start on smaller segments, use low-latency voice mode, reduce prosody warmup
Playback	Big client jitter buffer	Lower buffer floor, pre-open audio path, simplify browser pipeline
Simple, no? When the ear is slow, switching TTS vendors solves
nothing. When the brain is slow, chasing playback tricks solves
little. When the voice is slow, prompt pruning cannot rescue
first audio. Senior teams isolate the guilty stage first.

Then they spend effort where it matters. The relay race becomes manageable when every runner owns a split time.

Shrink prompts, switch vendors, or add caching?¶

This is the practical decision section. Many interviews ask this badly. They ask, "How do you make it faster?" A better answer names the stage, then chooses the lever. Use this decision logic.

When to shrink prompts¶

Shrink prompts when the brain dominates latency, TTFT grows with context size, and the answer quality does not justify the bloat. If the model receives six pages to answer one short question, you are probably paying for noise. Trim instructions. Compress memory. Retrieve fewer chunks.

Move long policy text into structured rules when possible. The awkward pause often starts with overfeeding the brain.

When to switch vendors or models¶

Switch vendors or models when a stage is already optimized locally, but the dependency itself misses the budget consistently. If the ear misses p95 even with healthy chunks and network, try another ASR stack. If the voice misses first-audio targets across regions, compare TTS providers.

If the brain has irreducible TTFT on your current model, try a faster class of model first. Do not switch blindly. Measure before and after.

When to add caching¶

Add caching when repeat work dominates, not when every turn is unique. Cache prompt scaffolds. Cache tool results that change slowly. Cache TTS for repeated fixed phrases. Cache routing decisions for common intents. But do not pretend caching solves live reasoning.

The user's fresh utterance still needs the ear, the brain, and the voice in the relay race. Look. Caching is a stage-specific accelerant, not a universal prayer.

A simple debugging script for slow turns¶

When someone says, "the system is slow," respond with a checklist.

Confirm the complaint is about p95 or p99, not one dramatic anecdote.
Split the turn into end-of-turn, ear, brain, voice, and playback timings.
Find the stage that most often breaks budget.
Fix that stage first.
Re-measure the full relay race. This sounds obvious. But many teams skip step two. Then they optimize randomly. That wastes weeks. The awkward pause survives because no one gave it a home address.

Where this lives in the wild¶

OpenAI Realtime API voice app — performance engineer: watches TTFT and first-audio percentiles to keep conversations feeling immediate.
Alexa-style household assistant — speech systems engineer: budgets the ear, brain, and voice separately across noisy home environments.
Banking call-center bot — platform SRE: tracks p95 tails because one slow identity step breaks trust fast.
In-car assistant stack — embedded AI engineer: budgets every stage tightly because road conversations punish hesitation.
Hospital scheduling voice agent — reliability engineer: uses stage splits to decide whether ASR, LLM, or TTS is causing the awkward pause.

Pause and recall¶

What does latency budgeting add beyond saying, "the system is slow"?
Why should voice teams track p50, p95, and p99 instead of averages?
When should you shrink prompts instead of switching vendors?
Which stage budgets usually define the awkward pause most strongly?

Interview Q&A¶

Q: What is latency budgeting in a voice assistant? A: It is the practice of assigning milliseconds to named stages so slow experience can be traced to a specific culprit instead of blamed on the whole system. Common wrong answer to avoid: "It means trying to make the average latency low." Q: Why are p95 and p99 more informative than averages for voice UX? A: Because users remember slow tail turns and interruptions in rhythm, while averages can look healthy even when many important calls feel bad. Common wrong answer to avoid: "Average is enough if the sample size is large." Q: When is prompt shrinking the right optimization? A: When the brain is the main bottleneck and extra prompt context is inflating TTFT without delivering proportional quality. Common wrong answer to avoid: "Always shrink prompts first because it is the easiest optimization." Q: When should you consider switching vendors? A: When a specific stage keeps missing budget after local fixes, making the dependency itself the likely source of the tail. Common wrong answer to avoid: "Whenever the full pipeline feels slow, change every provider together."

Apply now (5 min)¶

Exercise. Pick one voice product you use or admire. Write a rough budget for end-of-turn, the ear, the brain, the voice, and playback. Then say which stage would scare you most at p95.

Sketch from memory. Redraw the budget staircase. Write p50, p95, and p99 beside it. Circle where the awkward pause becomes visible to a user.

Bridge. A good budget sets the target. Now barge-in breaks that plan and forces the system to recover gracefully. → 07-interruption-barge-in.md