10. Evaluation and debugging — voice systems punish vibes¶

~17 min read. If you cannot name the slow or broken stage, you do not yet understand the system.

Built on the ELI5 in 00-eli5.md. The awkward pause — our name for felt latency — becomes measurable only when every stage logs its own timestamps.

First picture: one user turn becomes a chain of timed events¶

Look at the flow before the metrics. A caller stops speaking. The system detects speech end. ASR finalizes text. The model starts responding. TTS emits first audio. Playback begins. Only then does the user hear momentum.

user stops speaking
        │
        ▼
┌──────────────┐
│ speech end   │
└──────┬───────┘
       ▼
┌──────────────┐
│ ASR final    │
└──────┬───────┘
       ▼
┌──────────────┐
│ LLM 1st token│
└──────┬───────┘
       ▼
┌──────────────┐
│ TTS 1st audio│
└──────┬───────┘
       ▼
┌──────────────┐
│ playback     │
└──────────────┘

See. The user only feels one delay. Engineers must see five or six smaller delays. That is the whole discipline.

So what should you log? At minimum, log speech-end timestamp, ASR final timestamp, LLM first token, TTS first audio, playback start, barge-in count, false endpoint count, and WER by accent slice. Without these, you are debating mood, not system behavior.

Name the stage, measure the stage, fix the stage¶

This is the working rule. Name the stage. Measure the stage. Fix the stage. Then test again. Voice punishes vague thinking. So debugging must stay stage-specific.

problem felt by user
        │
        ▼
┌──────────────┐
│ name stage   │
└──────┬───────┘
       ▼
┌──────────────┐
│ measure it   │
└──────┬───────┘
       ▼
┌──────────────┐
│ fix only that│
└──────┬───────┘
       ▼
┌──────────────┐
│ re-measure   │
└──────────────┘

Suppose users say, "It feels slow after I stop talking." That complaint spans several stages. Maybe VAD waits too long. Maybe ASR finalization is slow. Maybe the model delays its first token. Maybe TTS buffers too much before sending audio. Maybe playback starts late on the client. Same symptom. Different fixes.

That is why the relay race is the right picture. One baton handoff can fail, while every other runner looks healthy. If you average everything together, you hide the broken handoff.

Now place the placeholders carefully. The ear owns speech end detection, ASR final timing, and word error patterns. The brain owns first-token delay, tool stalls, and reasoning pauses. The voice owns synthesis startup, chunk pacing, and playback readiness. The system is one experience, but the fixes are rarely in one component.

The core metrics that matter in production¶

Let us walk through the main metrics plainly. Picture first. Math later.

Speech-end timestamp. This marks when the user truly stopped speaking. It anchors every later delay calculation. If this event is wrong, all downstream latency numbers look misleading.

ASR final timestamp. This tells you how fast the ear turns speech into stable text. Track partial and final separately. A fast partial can feel good, but a late final can still block the next stage.

LLM first token timestamp. This is the moment the brain starts visibly responding. It is more useful than full completion time for conversation feel. If this number jumps, inspect prompt size, retrieval stalls, tool calls, and provider queue time.

TTS first audio timestamp. This shows how fast the voice begins to speak. Many teams forget it. That is a mistake. TTS startup can quietly dominate perceived lag, especially when text arrives in bursts.

Playback start timestamp. This is when the user actually hears sound. Client buffering, network jitter, and platform APIs can make this later than first audio generation. If you skip playback start, you may blame the wrong server stage.

Barge-in count. How often do users interrupt playback? Some interruptions are healthy. Many interruptions mean impatience, mis-tuned pacing, or long robotic preambles. Measure rate, not just anecdotes.

False endpoint count. How often did the system decide a turn ended when it did not? This is a silent killer. It makes the assistant feel rude, jumpy, and confused. Telephony conditions make it worse.

WER by accent slice. Do not report one global WER and feel proud. Break it by accent, noise profile, channel, and device class. Benchmarks flatter systems when clean speech dominates. Production does not.

Example latency budgets keep teams honest¶

Now let us put realistic numbers on the table. For a strong browser path, roughly 900 milliseconds end-to-end can feel decent. For mobile, roughly 1100 milliseconds is a useful working expectation. For PSTN, roughly 1260 milliseconds p95 is a realistic production target. These are not laws. They are grounding numbers.

channel      target feel
browser   ── ~900ms
mobile    ── ~1100ms
PSTN p95  ── ~1260ms

Simple, no? The channel changes the budget before model quality changes. So a browser success does not prove PSTN success. We saw that already in telephony. Now we measure it directly.

Also remember this. The awkward pause is not only model time. It includes endpointing, transport, ASR finalization, LLM start, TTS start, and playback start. Teams that optimize only inference often miss the larger win.

Debugging stage by stage beats arguing by vibes¶

A good debug review sounds boring. That is good. It sounds like this. "Speech end was stable. ASR final got 180 milliseconds worse on carrier B. LLM first token stayed flat. TTS was unchanged. Playback start regressed on Android." That is engineering. Everything else is storytelling.

If ASR looks bad, inspect audio samples, VAD thresholds, packet loss, and accent slices. If LLM start looks bad, inspect prompt length, retrieval latency, provider queueing, and tool policies. If TTS looks bad, inspect chunk size, voice choice, cache hits, and synthesis mode. If playback looks bad, inspect client buffers, web audio setup, mobile APIs, and jitter handling.

See. Each stage has its own knobs. That is why the relay race helps again. Debug the runner, not the whole stadium.

You also need traces that stitch the stages together. One request id should follow capture, ASR, LLM, TTS, and client playback events. Without shared IDs, you cannot connect symptom to cause. With shared IDs, patterns appear quickly.

A/B testing voice systems is harder than teams expect¶

Text A/B tests are already tricky. Voice A/B tests are harder. Quality labels are noisy. Human raters disagree. Users tolerate some mistakes, then suddenly rage at one awkward interruption. The tail matters more than the average.

A new model may improve mean latency, yet worsen false endpoints on one accent group. A new voice may sound warmer, yet increase interruption rate. A new ASR setting may improve clean audio, yet fail on noisy PSTN calls. So what to do? Log structured stage metrics, review audio samples, and inspect long-tail failures directly. Do not worship only the average score.

And please remember this. The ear, the brain, and the voice can each improve separately while the full system still feels worse. If the handoffs break, the user experience breaks. That is why the awkward pause remains a system property, not one provider metric.

Where this lives in the wild¶

OpenAI Realtime support agent — observability engineer: stage timestamps reveal whether delay comes from ASR, model, or TTS.
Duolingo-style speaking tutor — evaluation scientist: WER by accent slice prevents clean-speech averages from hiding real failures.
Call-center automation platform — reliability engineer: false endpoints and barge-in counts show when turn-taking settings need retuning.
Mobile voice assistant rollout — client engineer: playback-start metrics separate device buffering issues from server delays.
PSTN appointment bot — staff engineer: p95 channel budgets stop browser numbers from misleading the launch decision.

Pause and recall¶

Which timestamps create the minimum useful latency trace for one spoken turn?
Why is first token more useful than full completion time for conversational feel?
Why can a healthy average score still hide a bad production voice system?
What does stage-by-stage debugging prevent teams from doing badly?

Interview Q&A¶

Q: Why should a voice team log playback-start time in addition to TTS first audio time? A: Because the user feels actual playback, not server generation alone, and client buffering can shift that moment later. Common wrong answer to avoid: "Once TTS starts generating audio, the latency problem is solved."

Q: Why is WER by accent slice more useful than one global WER number? A: A single average can hide serious performance gaps across accents, channels, and noise conditions that matter in production. Common wrong answer to avoid: "If global WER looks good, fairness and robustness are already covered."

Q: Why is stage-by-stage debugging better than saying the whole system feels slow? A: Each stage has different causes and knobs, so naming and measuring the exact stage leads to targeted fixes. Common wrong answer to avoid: "The fastest fix is to switch the LLM provider first."

Q: Why are A/B tests unusually tricky for voice systems? A: Labels are noisy, user tolerance is nonlinear, and rare failures like false endpoints can outweigh average improvements. Common wrong answer to avoid: "Just compare mean user ratings and pick the higher one."

Apply now (5 min)¶

Exercise. Write one spoken-turn timeline with five timestamps. Then pretend users say, "It feels slow after I stop speaking." Circle the two timestamps you would inspect first, and explain why.

Sketch from memory. Draw the pipeline from speech end to ASR final to LLM first token to TTS first audio to playback start. Write one possible failure beside each stage.

Bridge. Once we can measure every stage, we can also admit something uncomfortable: some problems remain genuinely hard even with good instrumentation. → 11-honest-admission.md