03. Turning a lossy stream into words, and knowing when the caller is actually done¶
~17 min read. The model now hears frames. But frames are not words, and the harder problem is not "what did they say" — it's "have they finished saying it?" Guess too early and you cut the caller off mid-account-number. Guess too late and you add a second of dead air to every turn.
Built on 02-telephony-and-audio-integration.md. This chapter turns forked frames into transcripts and decides turn boundaries. Endpointing here is the same turn-detection pressure that barge-in (chapter 02) opened and that orchestration (chapter 04) spends from the turn budget. The partial vs final transcript tradeoff named in the overview lands here.
Note: the voice-agents module covers acoustic models, CTC/attention decoders, and beam search in depth. This chapter assumes that and focuses on the contact-center seam: streaming transcription over a lossy phone channel, the interim-vs-final tradeoff, endpointing/turn detection, diarization, and the accent/noise robustness the billing line actually needs.
What the media layer handed us, and what it cannot tell us¶
Chapter 02 got 20 ms μ-law frames to the model during the turn, not after it, and warned that the channel drops frames and mangles digits. That solved delivery. It said nothing about understanding. A stream of frames is just energy over time. The bot still needs two things from it: the words ("what's my balance"), and the boundary ("the caller has finished their turn, my turn now").
The first is transcription. The second is endpointing — and it is the one that quietly decides whether the bot feels human. A perfect transcript that arrives 900 ms after the caller stopped feels robotic. A fast transcript that the bot acts on before the caller finished their sentence cuts them off. Both are failures, and both are decided here, before the LLM has seen a single word.
By the end you can name the difference between an interim and a final transcript, why acting on the wrong one breaks the call, how endpointing decides the caller is done rather than just pausing, why two speakers on one channel need diarization, and why phone accents and background noise are correctness problems on the highest-stakes tokens.
What this file solves¶
A bot can transcribe a clean studio recording perfectly and still mishear "4417" as "417" on a phone, or cut a caller off the moment they breathe, or transcribe the bot's own voice as caller speech. This file shows how streaming ASR emits unstable interim transcripts and stable finals, how endpointing detects end-of-turn from silence and meaning, how diarization separates caller from agent on a mixed channel, and how to confirm high-stakes tokens — so the bot takes its turn at the right moment with the right words.
Why "did they stop talking" is the wrong question¶
The naive view of turn-taking: the caller is talking when there's audio energy, and done when it goes quiet. Detect 300 ms of silence, declare the turn over, send the transcript to the LLM. A voice-activity detector (VAD) gives you exactly this for almost no cost.
It works on a demo where people speak in clean, complete sentences. It falls apart on a real billing call, because humans pause inside a turn constantly. "My account number is four... four... one... seven" has three silences longer than 300 ms, none of which mean "I'm done." Trust silence alone and the bot interrupts after the first "four," asks "is that your whole number?", and the caller — who was mid-thought — is now furious. The opposite tuning (wait 1500 ms to be safe) adds 1.5 seconds of dead air to every turn, and the bot feels slow and dim.
So the real problem is not "detect silence faster." It is that silence does not equal end-of-turn — a pause can be hesitation, breathing, or thinking, and only meaning distinguishes them. How can the bot tell a mid-thought pause from a genuine finish?
That question is why modern voice agents pair acoustic silence detection with semantic turn detection: a model that looks at what was said so far and asks "is this a complete thought, or is the caller obviously mid-utterance?" "My account number is four four one—" is acoustically silent for a moment but semantically unfinished. AssemblyAI's Universal-Streaming and similar systems use a neural turn-detection model on top of acoustics for exactly this; Deepgram exposes both silence-based endpointing (default ~300 ms) and an UtteranceEnd signal that survives noisy gaps.
Rule: the bot acts on finals and endpoints, not on partials and silence¶
The load-bearing rule of the perception layer: stream interim transcripts to feel responsive, but only commit an action when you have a final transcript and a confident end-of-turn signal — and confirm high-stakes tokens regardless. Interims are for the human feel and for speculative prep; finals plus endpointing are for decisions.
Why this rule exists. The primitive is that streaming ASR is revisable: it emits a best guess early (
is_final: false) and corrects it as more audio arrives, settling on a stableis_final: true. The constraint is that the early guess is often wrong on exactly the tokens that matter — digits, names, amounts. Act on a partial and you act on a guess the model itself was about to revise. Act on silence alone and you act mid-sentence. The rule keeps actions tied to stable inputs while still using the unstable ones to hide latency.
1) Interim vs final — the revisable transcript, frame by frame¶
Watch the transcript evolve as the caller says "what's my current balance" on the billing line. Each line is what the ASR emits as more 20 ms frames arrive.
t=120ms "what" is_final:false
t=300ms "what's my" is_final:false
t=520ms "what's my current ballast" is_final:false ← wrong guess!
t=700ms "what's my current balance" is_final:false ← corrected
t=900ms "what's my current balance" is_final:true ← stable
t≈1200ms UtteranceEnd / end-of-turn ← caller is done
The interim at 520 ms said "ballast." If the bot had acted on it, it would have asked about cargo weight. By 700 ms more audio disambiguated it, and the final at 900 ms is stable. The end-of-turn signal comes later still, once endpointing is confident the caller stopped for real.
This is the whole interim/final tension. Interims arrive in ~150–300 ms (Deepgram quotes ~150 ms model latency; AssemblyAI Universal-Streaming ~300 ms P50 for immutable transcripts) and are unstable. Finals arrive a few hundred milliseconds later and are stable. The bot uses interims to show life — to begin thinking, to start a "let me check that" — but commits only on finals.
PARTIAL FINAL
fast (~150–300ms) slower (+a few hundred ms)
unstable, gets revised stable, won't change
use: feel, speculative prep use: commit the action
"ballast" → "balance" "balance"
For the billing bot, the move is: stream interims into a fast-feedback path (start formulating the balance lookup speculatively), but do not fire the CRM lookup or read back a number until the final lands and endpointing says the caller is done.
2) Picture: two clocks racing inside one turn¶
The mental model that keeps endpointing honest: every turn runs two clocks at once — a transcription clock (how fast words stabilize) and a turn clock (how fast you decide the caller finished). They are not the same, and the felt latency is the later of the two.
ONE CALLER TURN ("my account is four four one seven")
audio ▕████ ████ ████ ████▏ ........ (pauses between digits)
│
TRANSCRIPTION CLOCK
│ interims: "four" "four four" "four four one" "...seven"
│ final: ┌──── stable
▼ │
TURN CLOCK (endpointing) │
│ silence? ──300ms──▶ "maybe done?" ◀── semantic: "incomplete, wait"
│ ▼
│ confident end-of-turn ──▶ ACT
└──────────────────────────────────────────────────────────▶ time
Too-eager turn clock → cuts caller off after first "four"
Too-patient turn clock → dead air after caller truly finished
Right: acoustic silence + semantic completeness agree → act
The felt responsiveness of the bot is governed by the turn clock, not the transcription clock. You can have the fastest ASR in the world and still feel slow if endpointing waits 1.5 s "to be safe." This is why turn detection, not raw ASR speed, is where senior engineers spend their tuning budget.
3) The running example: collecting the account number over a lossy leg¶
Thread the billing call. After "I think I was double-charged," the bot must authenticate, which starts with the account number. The caller is on a mobile carrier with the jitter from chapter 02. They say: "four... four... one... seven."
Attempt A — VAD-only endpointing, act on first final¶
The bot uses 300 ms silence detection. The caller pauses after "four four" (thinking). 300 ms of silence trips the endpoint. The bot commits a final of "four four," fires the lookup, finds no account ending "44," and says "I couldn't find that account." The caller, mid-number, is baffled and angry. Worse, a dropped frame (chapter 02) on the lossy leg turned the spoken "one" into near-silence, so even the digits it did capture are suspect.
Attempt B — semantic endpointing + read-back confirmation¶
The bot streams interims but recognizes, semantically, that "four four" inside an account-number collection is incomplete — it's expecting more digits. Endpointing holds. It waits for either a long confident silence or a completeness signal. When the caller finishes, the final stabilizes to "four four one seven." Because digits are high-stakes and the channel is lossy, the bot does not trust the single pass: it reads back "I have account ending four-four-one-seven, is that right?" The caller confirms. Now it fires the chapter-07 CRM lookup.
The hard part hiding here: the bot needed context to endpoint well. A bare VAD has no idea it's in the middle of a digit sequence. The dialog state ("I am collecting an account number") feeds the turn clock. This is why endpointing is not purely an ASR concern — it couples to orchestration (chapter 04).
4) Why streaming ASR with semantic endpointing instead of VAD + batch ASR — choosing under a conversational workload¶
The plausible alternative under a "keep it simple" mindset: use a cheap VAD to chop the audio into utterances, then send each finished chunk to a high-accuracy batch ASR. It's simpler and the batch model is often slightly more accurate.
- VAD chop + batch ASR per utterance — simple, accurate per chunk, cheap to wire. But it can only endpoint on silence (interrupts mid-thought), it has no interims (no responsiveness, the bot can't start thinking until the chunk is done and transcribed), and chunk boundaries cut words. Fine for voicemail transcription; wrong for conversation.
- Streaming ASR + semantic endpointing — interims for feel and speculative prep, finals for commit, turn detection that understands completeness. Costs more per minute and is more complex, but it's the only option that hits a sub-second turn under a conversational workload.
For a billing line where the bot and caller take dozens of turns and every turn must feel snappy, streaming with semantic endpointing wins. The deciding question: is the audio a conversation (needs turn-taking) or a recording (needs only transcription)? Conversation forces streaming. Recording (chapter 06) can batch.
5) The property that changes the design: phone accents and noise hit the highest-stakes tokens¶
The dimension people underestimate is robustness on the specific tokens that carry verification weight, over a channel that's already degraded. Phone audio is 8 kHz, band-limited, and lossy (chapter 02). Now add a caller with a strong regional accent, a noisy street, or a code-switching speaker. General ASR word-error-rate might still look fine — because common words ("the," "I," "want") are robust and dominate the average. But digits, names, and account numbers have little redundancy, and they're exactly what authentication and payment depend on.
Overall WER on the call: 4% ← "looks fine"
WER on common words: 2%
WER on digits/names/IDs: 11% ← the tokens that gate auth & payment
A 4% average hides an 11% failure rate on the only tokens that matter.
This is the asymmetry that should change your design: never grade ASR on average WER for a contact-center bot; grade it on entity accuracy — digits, names, account numbers, amounts. And design the dialog to confirm those entities (read-back, spell-back, "did you say one-five or five-zero?") rather than trust a single transcription. Phone-tuned models (Deepgram's telephony models, accepting μ-law directly) and domain keyword boosting raise entity accuracy specifically.
6) One failure walked through: the bot that transcribed its own voice¶
Incident: the billing bot occasionally "hears" things the caller never said — phantom turns, sudden topic jumps, balance read-backs the caller didn't ask for. ASR confidence on these phantom turns is high. The audio is clean. It only happens while or just after the bot is speaking.
The chain: the call is a single mixed-audio channel (chapter 02 warned single-channel recording can't separate speakers). The bot's TTS, played into the caller leg, bleeds back through the bridge into the same channel the ASR is transcribing. Without diarization (separating who-spoke-which-words) or echo handling, the ASR happily transcribes the bot's own "your balance is fifty-nine dollars" as caller speech, and the orchestrator treats it as a new caller turn. The bot is talking to itself.
The root cause is not a bad ASR model — the transcription was correct, it just transcribed the wrong speaker. The fix is two-fold: separate channels where the platform allows (caller on one channel, bot/agent on another, so diarization is structural), and where you can't, use diarization plus knowing which audio is the bot's own output (the same self-echo problem barge-in faced in chapter 02). This is the same failure geometry as chapter 02's self-echo barge-in — the bot's own output contaminating its input — now at the transcription layer instead of the turn-detection layer.
7) Cost and latency movement: where milliseconds and accuracy trade in the perception layer¶
Per-turn contributions and rough costs (illustrative; varies by vendor, model, and contract):
| Knob | Effect on turn budget | Effect on accuracy | What it buys |
|---|---|---|---|
| Interim results on | none (parallel) | n/a | responsiveness, speculative prep |
| Endpointing silence 300→700 ms | +400 ms/turn | fewer false ends | fewer mid-thought cutoffs |
| Semantic turn detection | ~neutral to small add | far fewer cutoffs | natural turn-taking |
| Phone-tuned model | ~same latency | +entity accuracy | survives 8 kHz μ-law |
| Keyword/entity boosting | negligible | +digit/name accuracy | better auth success |
| Read-back confirmation | +1 turn (~1–2 s) | catches all entity errors | correctness on high-stakes tokens |
Rough ASR cost: streaming STT runs ~$0.01–0.02/min. The dominant latency knob is endpointing silence, and the dominant correctness knob is entity confirmation. The pressure evolution: semantic endpointing relieves the cut-off-vs-dead-air pressure but creates a dependency on dialog state (the turn clock now needs to know what the bot is collecting), absorbed by the orchestration layer. Read-back relieves entity-error risk but creates turn-budget cost (an extra round trip), absorbed by the caller's patience — which is why you confirm only the high-stakes entities, not everything.
8) Signals that the perception layer is the problem¶
Healthy: entity accuracy high on a held-out set of real account-number turns, end-of-turn timing that matches human raters' sense of "they were done," near-zero phantom (self-transcribed) turns.
First metric to degrade: mid-utterance cut-off rate — turns where the bot took over while the caller was still speaking. It climbs the moment endpointing is mistuned or a noisy carrier segment confuses the turn clock, well before overall WER moves.
Misleading metric people watch: average word error rate. A great-looking 4% average can hide an 11% digit error rate that fails authentication. WER on filler words is irrelevant; entity accuracy is everything.
First graph an expert opens: end-of-turn timing distribution (how long after the caller actually stopped did the bot act) overlaid with cut-off rate, plus entity accuracy segmented by token type (digits vs names vs free speech) and by carrier. A spike in cut-offs on one carrier points back to chapter 02's jitter confusing endpointing.
9) Boundary: where streaming ASR + endpointing shines, where it breaks¶
It shines on structured, transactional, turn-based phone conversations — the billing line — where turns are clear, entities are confirmable, and a phone-tuned model plus semantic endpointing covers most callers cleanly.
It becomes pathological on overlapping, multi-party, or highly disfluent speech: two people talking on a speakerphone, a caller arguing with someone in the room, heavy stuttering or code-switching mid-word. Endpointing thrashes (it can't find clean turn boundaries), diarization smears speakers together, and entity accuracy collapses. The scale limit that invalidates intuition: a model that endpoints beautifully for native, single-speaker, quiet callers can degrade sharply across a population with many accents, noisy environments, and bad connections — and the tail (the hardest 10% of callers) generates a disproportionate share of failed authentications and escalations, because those are exactly the callers the entity errors strand.
10) Wrong assumption: "lower word error rate means a better voice agent"¶
The seductive idea: pick the ASR with the lowest WER on a benchmark and the bot will be best. Two problems. First, benchmark WER is usually on clean, wideband, accent-light audio — nothing like an 8 kHz lossy phone call. Second, WER averages over all words, hiding the entity errors that actually break authentication and payment. A model with slightly higher overall WER but better digit accuracy and better endpointing makes a better voice agent.
Replace it with: for a voice agent, turn-taking feel and entity accuracy beat average WER. Evaluate on phone-like audio, measure entity accuracy and end-of-turn timing, and treat overall WER as a weak proxy. This reorders vendor selection entirely — and it's why chapter 06's post-call analytics, which has no turn-taking pressure and can batch high-accuracy models, makes different ASR choices than this live layer.
11) Other ways the perception layer bites¶
- Endpointing too aggressive on hold/IVR audio — the bot endpoints on a beep or hold-music gap and "takes a turn" into silence.
- No diarization on single channel — caller and bot words interleave into one transcript; the orchestrator loses track of who said what.
- Numbers spoken as words vs digits — "fourteen" vs "one four"; the bot must normalize, or "fourteen" and "forty" collide on a lossy channel.
- Hot-mic background speech — a TV or coworker is transcribed as caller turns, injecting noise into the dialog.
- Interim acted on prematurely — the bot fires a CRM lookup on an interim that later revises, querying the wrong account.
- Accent-driven entity errors — a caller's accent turns "fifteen" into "fifty" on the payment amount; no read-back catches it.
- Code-switching mid-sentence — caller switches language; a monolingual model produces garbage right at the entity.
- Long pause = false hang-up detection — a thinking caller is treated as gone; the bot disconnects or escalates wrongly.
12) Pattern transfer¶
- Interim vs final is speculative execution — same shape as a CPU running speculatively past a branch and squashing on misprediction: act early on the cheap guess for latency, but be ready to discard it when the stable answer disagrees. "Ballast → balance" is a squashed misprediction. The shared pressure: hide latency with a revisable guess without committing to it.
- Endpointing is the same boundary-detection problem as message framing — deciding where one turn ends mirrors deciding where one message ends in a byte stream. Silence-only endpointing is like length-prefix-only framing: it works until the data has gaps that aren't boundaries. Semantic endpointing adds a content-aware delimiter.
- Entity accuracy over average WER — structurally identical to tail-latency over average latency: the average hides the cases that actually hurt. Watch p99 on the tokens that gate a transaction, not the mean over all tokens.
13) Design test¶
- Does the bot commit actions on finals, or does it ever fire a lookup on an interim it later revises?
- Does endpointing use semantic completeness, or only silence — and does it know what entity it's collecting?
- Do you measure entity accuracy (digits, names, IDs) separately from average WER?
- Are high-stakes entities read back and confirmed, given the channel drops frames?
- Can the bot transcribe its own TTS as a caller turn — i.e., is there diarization or channel separation?
Where this appears in production¶
- Deepgram Nova-3 — streaming telephony ASR accepting μ-law; ~150 ms model latency, interim results, configurable endpointing (~300 ms default) and
UtteranceEndfor noisy gaps. - AssemblyAI Universal-Streaming — immutable low-latency transcripts (~300 ms P50) with neural turn detection using semantic + acoustic cues; tunable
end_of_turn_confidence_threshold. - NVIDIA Riva — self-hostable streaming ASR/TTS for on-prem contact centers with data-residency constraints.
- Google Cloud Speech-to-Text — streaming recognition with phone-call models and word-level confidence.
- Azure AI Speech — streaming STT with phrase lists and custom models for domain entities.
- Speechmatics — streaming ASR emphasizing accent robustness across a global caller base.
- Amazon Transcribe (streaming) — real-time transcription feeding Connect's live analytics.
- Deepgram interim_results + utterance_end_ms — the exact knobs that separate the responsiveness path from the commit path.
- AssemblyAI pre-emptive LLM generation — starting the LLM before the turn officially ends to save 200–500 ms, a direct interim-driven optimization.
- Pyannote / speaker diarization — separates caller from agent/bot on mixed-channel audio so the orchestrator knows who spoke.
- Picovoice / Silero VAD — fast voice-activity detection feeding the acoustic half of endpointing.
- Krisp — noise/echo suppression upstream of ASR, raising entity accuracy on noisy legs.
- Keyword/entity boosting (Deepgram keyterms, Google adaptation) — biases the model toward account-number and product vocabulary.
- Pipecat / LiveKit Agents — wire VAD + ASR + turn detection into the pipeline with interruption handling.
- Twilio ConversationRelay — packages streaming ASR + endpointing for Programmable Voice bots.
Recall¶
- What is the difference between an interim and a final transcript, and which one may an action be committed on?
- Why does silence-only endpointing cut off a caller reading an account number?
- What does semantic turn detection add over a VAD, and what does it need from the dialog?
- Why is entity accuracy a better metric than average WER for a voice agent?
- How can a bot end up transcribing its own voice, and what two fixes prevent it?
- Which metric degrades first when endpointing is mistuned, before WER moves?
- Why does the hardest tail of callers (accents, noise) generate a disproportionate share of escalations?
Interview Q&A¶
Q1. Your bot keeps interrupting callers who are reading out their account number. What's wrong and how do you fix it? Endpointing is firing on the natural pauses between digits — silence-only turn detection treats a mid-number pause as end-of-turn. The fix is semantic turn detection that knows the utterance is incomplete (you're collecting digits), plus feeding dialog state to the turn clock so it waits for a complete entity. Lengthening the silence threshold alone trades cut-offs for dead air on every turn. Common wrong answer to avoid: "just increase the endpointing silence to 1.5 seconds" — that adds dead air to every single turn and still mis-handles a slow speaker.
Q2. A vendor shows you 3% WER and a competitor shows 4%. Which do you pick for a billing bot? Not enough information — average WER hides entity accuracy. Ask for digit/name/account-number accuracy on phone-like (8 kHz, lossy) audio, plus endpointing/turn-detection quality. The 4% model with better entity accuracy and turn detection makes the better voice agent, because the tokens that gate authentication and payment are digits and names, not filler words. Common wrong answer to avoid: "the 3% one, lower WER is better" — average WER on clean benchmark audio is a weak proxy for a lossy phone channel and ignores entity accuracy.
Q3. The bot occasionally responds to things the caller never said, with high confidence. Diagnose it. It's transcribing its own TTS. On a single mixed-audio channel, the bot's spoken output bleeds back and the ASR transcribes it as a caller turn. The transcription is correct; it's the wrong speaker. Fix with channel separation (caller and bot on separate channels) and/or diarization plus knowing which audio is the bot's own output — the same self-echo geometry as barge-in in chapter 02, now at the transcription layer. Common wrong answer to avoid: "the ASR is hallucinating, switch models" — the model heard real audio; the bug is that it was the bot's own audio on a shared channel.
Q4. Why use interim transcripts at all if you can only safely act on finals? Interims hide latency. They let the bot feel alive (start a "let me check that"), and they enable speculative prep — beginning the CRM-lookup formulation or even pre-emptive LLM generation before the turn officially ends, which can save 200–500 ms. You just don't commit an irreversible action on a revisable guess; the commit waits for the final plus a confident endpoint. Common wrong answer to avoid: "interims are pointless if they're unstable" — they're the main lever for sub-second feel; the discipline is acting on them speculatively, not committing on them.
Q5. A caller's payment amount keeps coming through wrong — "fifteen" vs "fifty." Is this a chapter 2 media problem or a chapter 3 ASR problem, and what do you do? Likely both interacting. The lossy 8 kHz channel (ch 02) strips the acoustic cues that separate "-teen" from "-ty," and the ASR (ch 03) then guesses wrong on a low-redundancy entity. Inspect packet loss/jitter for that segment first; clean audio points to ASR/accent, lossy points to media. Regardless, the design fix is read-back confirmation of the amount — never commit a payment figure on a single pass. Common wrong answer to avoid: "swap the ASR vendor" — even a perfect model can't recover information the channel destroyed, and no amount of model tuning replaces confirming the entity.
Q6. How does endpointing couple to the rest of the pipeline — isn't it purely an ASR concern? No. Good endpointing needs dialog state: knowing the bot is collecting an account number tells the turn clock to expect more digits and not end early. The felt latency of the whole turn is the later of the transcription clock and the turn clock, so endpointing is shared between ASR (chapter 03) and orchestration (chapter 04). Tuning it in isolation ignores the context that makes it correct. Common wrong answer to avoid: "endpointing is just a VAD setting in the ASR config" — pure VAD has no idea what the caller is in the middle of, which is exactly why it cuts people off.
Q7. At scale, authentication failures cluster among a subset of callers. What's happening? The tail of hard callers — strong accents, noisy environments, bad connections — drives entity errors on digits and names, and those are exactly the tokens authentication depends on. Average WER looks fine because the easy majority dominates it. Segment entity accuracy by accent/carrier/noise, add phone-tuned models and keyword boosting, and lean on read-back confirmation for the tail. The failures concentrate where the channel and the speaker are both hard. Common wrong answer to avoid: "overall accuracy is 96%, so it's fine" — the 4% is not random; it concentrates on the high-stakes tokens for the hardest callers, which is where escalations come from.
Design/debug exercise (10 min)¶
Step 1 — Modeled example. Trace the account-number turn (section 3, Attempt B): interims stream → semantic endpointing recognizes "four four" is incomplete and holds → final stabilizes to "four four one seven" → read-back confirmation → commit the chapter-07 lookup. For each step, write the failure that occurs if you skip it.
Step 2 — Your turn. The billing bot must capture a payment amount — "fifty-nine dollars" — over a noisy leg where "fifteen/fifty" is ambiguous. Design the capture: what does it read back, how does it disambiguate "-teen/-ty," when does it commit, and how does it tie into the chapter-04 orchestration decision to take the payment? Note where you'd measure entity accuracy for amounts.
Step 3 — Reproduce from memory. Redraw the two-clocks diagram (section 2) cold: the transcription clock and the turn clock racing within one turn, with the too-eager and too-patient failure modes marked. Then connect it back to chapter 02: show where a lossy carrier segment makes both clocks misfire (dropped digit + confused endpoint).
Operational memory¶
This chapter explained why a bot can transcribe perfectly and still feel broken: it cuts callers off mid-number, mishears the digits that gate authentication, or transcribes its own voice as a caller turn. The important idea is that perception is two jobs running on two clocks — transcribing words (interim then final) and detecting the end of a turn (silence and meaning) — and the bot must commit actions only on finals and confident endpoints while confirming high-stakes entities.
You learned to stream interims for feel and speculative prep, commit on finals, endpoint on semantic completeness rather than raw silence, separate speakers via channels or diarization, and grade ASR on entity accuracy instead of average WER. That solves the opening cut-off, mishear, and phantom-turn failures because each was a turn-clock, entity-accuracy, or speaker-separation problem, not a generic "the model is bad" problem.
Carry this diagnostic forward: when the bot interrupts callers, look at endpointing and dialog state before WER. When authentication fails for some callers, segment entity accuracy by token type and carrier. When the bot responds to nothing, check for self-transcription on a mixed channel.
Remember:
- Silence is not end-of-turn; pair acoustic detection with semantic completeness, fed by dialog state.
- Stream interims to feel fast and prep speculatively; commit actions only on stable finals plus a confident endpoint.
- Grade a voice agent on entity accuracy (digits, names, IDs) and turn-taking feel, not average WER.
- Confirm high-stakes entities by read-back; the lossy phone channel guarantees occasional mishears.
- A bot can transcribe its own TTS on a mixed channel — use channel separation or diarization.
Bridge. The bot now has reliable words and knows when the caller is done. But knowing what was said and when is not the same as deciding what to do — and it must decide, fetch, and speak back inside a sub-second turn that ASR has already partly spent. Spending that budget across NLU, dialog, an LLM, and TTS without the call feeling dead is the next pressure. → 04-bot-orchestration-and-latency-budget.md