03. Streaming ASR — Teach the ear to listen live¶
~16 min read. Realtime voice begins with how fast the ear can hear meaning, not just sound.
Built on the ELI5 in 00-eli5.md. The ear — the part that turns speech into text — must work incrementally, because waiting for perfect text creates the awkward pause.
1) Batch ASR feels accurate, streaming ASR feels alive¶
See. Batch ASR waits for the whole recording. Then it transcribes the whole recording. That can be excellent for offline quality. It is terrible for live conversation rhythm. The ear needs to work while the speaker is still talking. That is what streaming ASR means. Audio arrives in chunks. The ear keeps updating its guess. The assistant starts reacting before the turn fully ends. That is the relay race again.
Whisper is a useful quality baseline. It teaches many engineers what good transcription can look like. But a pure batch-style setup is often too slow for live back-and-forth. So teams usually compare against streaming providers such as Deepgram or AssemblyAI. The exact vendor matters less than the behavioral shift. Batch says, wait for completeness. Streaming says, move partial evidence forward carefully. Simple, no?
Here is the picture.
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ audio chunks │─→│ the ear │─→│ partial text │
│ every 20 ms │ │ streaming │ │ then final │
└──────────────┘ └──────────────┘ └──────────────┘
The ear is not producing one sacred transcript. It is producing a changing hypothesis over time. That mental model prevents many downstream bugs. If your brain treats early text as final truth, bad things happen quickly. The user says one phrase. The ear revises a word. The brain has already committed to the wrong intent. Now the whole response drifts. So what to do? Treat the stream as provisional until confidence or endpoint rules make it stable. That is mature streaming design.
2) Partials, finals, and why revisions are normal¶
A streaming system often emits partial transcripts first. These are useful, but they are not promises. The phrase may grow. A word may change. Punctuation may appear later. Entity boundaries may move. Accents and jargon make this even more obvious. So never treat partials as final truth by default. Yes?
Imagine the user says model context protocol. The ear may first emit modern context protocol. A few chunks later, it corrects itself. That is healthy behavior. It means the ear kept listening. It does not mean the system failed. The failure happens when downstream code cannot tolerate revision. Maybe the brain fires a tool too early. Maybe the UI shows unstable claims as settled facts. Maybe the voice starts speaking back from a bad partial. That is how a small ASR wobble becomes a product bug.
Use states. Mark text as partial, stable, or final. Pass timestamps with each segment if possible. Keep revision logic explicit. A simple rule helps beginners. Use partials for anticipation. Use finals for commitment. The relay race still benefits. But your system stays honest.
audio ──→ partial 1 ──→ partial 2 ──→ partial 3 ──→ final
│ │ │ │
▼ ▼ ▼ ▼
hint draft only tentative UI commit
Word-level timestamps help here too. They tell you where the wobble happened. They help align captions, barge-in logic, and debugging dashboards. When the user says the assistant keeps missing their name, those timestamps let you inspect exactly where the ear got confused. Named evidence beats vague complaints.
3) VAD and endpointing decide when a turn exists¶
The ear does not only transcribe words. It must also decide when speech starts and stops. That is where VAD enters. VAD means voice activity detection. It answers a smaller question than ASR. Is there human speech energy here or not? WebRTC VAD is lightweight and widely used. Silero VAD is a strong open-source option too. Neither one is magical. Both need tuning against your environment. Keyboard noise, car hum, breathing, and cross-talk complicate the job.
Why does this matter so much? Because endpointing decides when the system believes the turn is over. If you endpoint too early, you cut the user off. If you endpoint too late, you add awkward pause. That trade-off is emotional, not only technical. Silence-based endpointing is simple. Wait for a threshold of quiet and then close the turn. Semantic endpointing is smarter. It considers whether the utterance sounds complete. Hybrid endpointing mixes both ideas. That is usually the practical path.
See the state picture.
┌────────┐ speech ┌──────────┐ silence ┌────────────┐
│ idle │───────────→│ listening │─────────→│ maybe done │
└────────┘ └──────────┘ └────────────┘
▲ │ │
└──────── noise ───────┴──── resume speech ──┘
Real users pause mid-thought. They say uh, wait, actually, and let me think. A rigid silence timer mistakes those for turn endings. That is why hybrid logic often wins. The ear watches energy. The brain can contribute a semantic hint. The relay race continues, but commitment waits for stronger evidence. Simple, no?
4) Latency tactics and the cracks exposed by real speech¶
Accents expose weak phoneme assumptions. Jargon exposes vocabulary gaps. Noisy rooms expose fragile preprocessing. Children, elders, and mixed-language speakers expose brittle benchmarks. So never judge the ear only on clean demo clips. Judge it on the people who actually pay for the product. That is where cracks appear.
Latency tactics start with boring discipline. Warm connections before the user speaks. Use short chunks without going absurdly tiny. Co-locate regions so packets do not cross continents unnecessarily. Reuse sessions when the provider supports it. Avoid extra encode-decode loops between browser, server, and vendor. Measure first partial time separately from final transcript time. Those two numbers tell different stories.
Also instrument error buckets. Was speech missed because VAD never opened? Did endpointing close too early? Did partials revise too often? Did a proper noun fail repeatedly? Did region latency spike? Was the microphone itself clipping? The ear is a pipeline, not a black box. Once you name the stages, improvement becomes systematic. The awkward pause stops feeling mystical. It becomes a budget problem.
One more practical lesson. Warm models and warm sockets matter more than clever dashboards. The best monitoring in the world cannot rescue a cold start during a call. So what to do? Keep hot paths hot. Test accents intentionally. Log timestamps precisely. And remember that the ear is only the first runner. If it starts late, everybody behind it loses.
Where this lives in the wild¶
- AI receptionist — backend engineer: tune VAD and endpointing so callers are not cut off mid-sentence.
- Sales call copilot — applied scientist: separate partial transcript quality from final transcript quality.
- Medical dictation assistant — platform engineer: inspect word-level timestamps when clinical terms drift.
- Language tutoring app — product engineer: handle accent variation without treating every pause as turn end.
- Drive-thru ordering bot — reliability engineer: co-locate streaming ASR regions to shrink first partial delay.
Pause and recall¶
- Why is batch ASR a poor default for live conversation?
- Why should partial transcripts not be treated as final truth?
- What is the difference between VAD and endpointing?
- Why do timestamps help debug the ear more effectively?
Interview Q&A¶
Q: What is the key behavioral difference between batch ASR and streaming ASR? A: Batch waits for the full recording, while streaming updates the transcript incrementally so the relay race can begin early. Common wrong answer to avoid: Streaming ASR just means a faster batch model.
Q: Why are partial transcripts dangerous when handled carelessly? A: Because they can revise. If downstream systems treat them as final truth, tools fire early and responses drift. Common wrong answer to avoid: Partials are unreliable, so never use them for anything.
Q: What job does VAD do that ASR does not? A: VAD decides whether speech is present at all. ASR then turns speech into words. Common wrong answer to avoid: VAD and transcription are basically the same task.
Q: Why do hybrid endpointing rules often beat pure silence timers? A: Real users pause mid-thought, so silence alone cuts turns too early. Hybrid logic uses both acoustic and semantic clues. Common wrong answer to avoid: Just wait a fixed long silence and the problem disappears.
Apply now (5 min)¶
Exercise: Write three event labels for a streaming ASR pipeline: partial, stable, and final. For each label, note what downstream actions are allowed. Keep the rules short and strict.
Sketch from memory: Draw the ear receiving chunks, emitting partial text, and closing on endpointing. Label VAD, timestamps, the relay race, and the awkward pause. If you cannot label revision points, redraw once.
Bridge. The ear heard the user. Now the voice must answer without sounding slow or robotic. → 04-text-to-speech.md