06. Latency Budgeting — name every millisecond before it embarrasses you¶
~15 min read. Fast voice systems are designed with budgets, not with hope.
Built on the ELI5 in 00-eli5.md. The awkward pause — that uncomfortable silence after the user stops — becomes measurable only when the relay race is split into named stages.
First picture: budget the interpreter booth stage by stage¶
Picture the United Nations interpreter booth again. If the reply feels late, you do not say, "everything was bad." You ask sharper questions. Did the ear hear late? Did the brain think too long? Did the voice wait before speaking?
Did the relay race lose time between handoffs? That is latency budgeting. Latency budgeting means assigning milliseconds to named stages. Instead of one vague complaint, you get specific culprits. Simple, no?
user stops speaking
│
▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ turn detect │─▶│ the ear │─▶│ the brain │─▶│ the voice │─▶│ playback │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │ │ │
└──── named milliseconds in the relay race ──────────────────────────┘
The starting budget table¶
Here is a practical starting budget for many voice assistants. These are not holy numbers. They are useful defaults.
| Stage | Budget range | Why this range matters |
|---|---|---|
| End-of-turn detection | 200-400 ms | Short enough to feel responsive, long enough to avoid clipping slow speakers |
| STT finalization | 100-250 ms | The ear should stabilize text quickly after speech ends |
| LLM TTFT | 200-500 ms | The brain must begin early, not after a long silent think |
| TTS first audio | 100-300 ms | The voice should start with little setup delay |
| Playback start | 50-100 ms | The client needs a tiny buffer, not a large cushion |
| Total p95 target | 700-1200 ms | The awkward pause should usually stay under about a second |
| Look. A budget is not a guess about averages. It is a promise | ||
| about stages. If your p95 target is 1000 milliseconds, and the | ||
| brain alone takes 900, you already know the rest of the stack | ||
| cannot save you. Yes? |
This is why budgeting is liberating. It removes magical thinking.
Why percentiles beat averages¶
Many teams say, "average latency is fine." That sentence hides pain. Users do not experience averages. They experience actual turns, and the bad tail turns define trust. Track p50, p95, and p99. Do not lead with averages. p50 tells you the typical experience.
p95 tells you the slow-but-common pain. p99 tells you whether outliers are absurd. The awkward pause usually lives in the tail.
fast turns fast turns fast turns slow turns terrible turns
│ │ │ │ │
├──── p50 ────┤ ├── p95 ─────┤
├─ p99 ─┤
A small example helps.
| Metric | Example value | What it tells you |
|---|---|---|
| Average | 620 ms | Looks comforting, but hides ugly slow turns |
| p50 | 540 ms | Median user experience feels crisp |
| p95 | 1080 ms | Slow turns are noticeable but maybe acceptable |
| p99 | 1820 ms | Rare failures are still too painful |
| If p50 is good, but p95 is bad, your system is inconsistent. If | ||
| p50 and p95 are both bad, then the whole design needs help. If | ||
| p99 explodes while p95 looks fine, search for retry storms, | ||
| network jitter, or overloaded dependencies. |
Optimization levers for each stage¶
Now let us make the budget actionable. Each stage has different levers. Do not attack every delay with the same hammer.
| Stage | Common culprit | Useful levers |
|---|---|---|
| End-of-turn | Silence threshold too cautious | Tune VAD threshold, adapt by speaker speed, reduce forced wait |
| The ear | ASR decoder delay, large chunks | Send smaller chunks, use streaming ASR, reduce resampling overhead |
| The brain | Long prompt, slow model, tool waits | Shrink prompt, use smaller model, trim tool calls, prefetch context |
| The voice | TTS setup, large sentence batching | Start on smaller segments, use low-latency voice mode, reduce prosody warmup |
| Playback | Big client jitter buffer | Lower buffer floor, pre-open audio path, simplify browser pipeline |
| Simple, no? When the ear is slow, switching TTS vendors solves | ||
| nothing. When the brain is slow, chasing playback tricks solves | ||
| little. When the voice is slow, prompt pruning cannot rescue | ||
| first audio. Senior teams isolate the guilty stage first. |
Then they spend effort where it matters. The relay race becomes manageable when every runner owns a split time.
Shrink prompts, switch vendors, or add caching?¶
This is the practical decision section. Many interviews ask this badly. They ask, "How do you make it faster?" A better answer names the stage, then chooses the lever. Use this decision logic.
When to shrink prompts¶
Shrink prompts when the brain dominates latency, TTFT grows with context size, and the answer quality does not justify the bloat. If the model receives six pages to answer one short question, you are probably paying for noise. Trim instructions. Compress memory. Retrieve fewer chunks.
Move long policy text into structured rules when possible. The awkward pause often starts with overfeeding the brain.
When to switch vendors or models¶
Switch vendors or models when a stage is already optimized locally, but the dependency itself misses the budget consistently. If the ear misses p95 even with healthy chunks and network, try another ASR stack. If the voice misses first-audio targets across regions, compare TTS providers.
If the brain has irreducible TTFT on your current model, try a faster class of model first. Do not switch blindly. Measure before and after.
When to add caching¶
Add caching when repeat work dominates, not when every turn is unique. Cache prompt scaffolds. Cache tool results that change slowly. Cache TTS for repeated fixed phrases. Cache routing decisions for common intents. But do not pretend caching solves live reasoning.
The user's fresh utterance still needs the ear, the brain, and the voice in the relay race. Look. Caching is a stage-specific accelerant, not a universal prayer.
A simple debugging script for slow turns¶
When someone says, "the system is slow," respond with a checklist.
-
Confirm the complaint is about p95 or p99, not one dramatic anecdote.
-
Split the turn into end-of-turn, ear, brain, voice, and playback timings.
-
Find the stage that most often breaks budget.
- Fix that stage first.
- Re-measure the full relay race. This sounds obvious. But many teams skip step two. Then they optimize randomly. That wastes weeks. The awkward pause survives because no one gave it a home address.
Where this lives in the wild¶
- OpenAI Realtime API voice app — performance engineer: watches TTFT and first-audio percentiles to keep conversations feeling immediate.
- Alexa-style household assistant — speech systems engineer: budgets the ear, brain, and voice separately across noisy home environments.
- Banking call-center bot — platform SRE: tracks p95 tails because one slow identity step breaks trust fast.
- In-car assistant stack — embedded AI engineer: budgets every stage tightly because road conversations punish hesitation.
- Hospital scheduling voice agent — reliability engineer: uses stage splits to decide whether ASR, LLM, or TTS is causing the awkward pause.
Pause and recall¶
- What does latency budgeting add beyond saying, "the system is slow"?
- Why should voice teams track p50, p95, and p99 instead of averages?
- When should you shrink prompts instead of switching vendors?
- Which stage budgets usually define the awkward pause most strongly?
Interview Q&A¶
Q: What is latency budgeting in a voice assistant? A: It is the practice of assigning milliseconds to named stages so slow experience can be traced to a specific culprit instead of blamed on the whole system. Common wrong answer to avoid: "It means trying to make the average latency low." Q: Why are p95 and p99 more informative than averages for voice UX? A: Because users remember slow tail turns and interruptions in rhythm, while averages can look healthy even when many important calls feel bad. Common wrong answer to avoid: "Average is enough if the sample size is large." Q: When is prompt shrinking the right optimization? A: When the brain is the main bottleneck and extra prompt context is inflating TTFT without delivering proportional quality. Common wrong answer to avoid: "Always shrink prompts first because it is the easiest optimization." Q: When should you consider switching vendors? A: When a specific stage keeps missing budget after local fixes, making the dependency itself the likely source of the tail. Common wrong answer to avoid: "Whenever the full pipeline feels slow, change every provider together."
Apply now (5 min)¶
Exercise. Pick one voice product you use or admire. Write a rough budget for end-of-turn, the ear, the brain, the voice, and playback. Then say which stage would scare you most at p95.
Sketch from memory. Redraw the budget staircase. Write p50, p95, and p99 beside it. Circle where the awkward pause becomes visible to a user.
Bridge. A good budget sets the target. Now barge-in breaks that plan and forces the system to recover gracefully. → 07-interruption-barge-in.md