05. Streaming Pipeline — the relay race must keep moving¶
~16 min read. Voice feels magical when work overlaps instead of waiting politely.
Built on the ELI5 in 00-eli5.md. The relay race — one baton passed at the right instant — shows how the ear, the brain, and the voice cooperate without dropping time.
First picture: think interpreter booth, not voicemail inbox¶
Picture the United Nations interpreter booth. One speaker talks. The interpreter does not wait for the whole speech. They hear a phrase, process a phrase, and speak a phrase. That is the relay race. The ear catches sound. The brain builds meaning. The voice starts speaking before everything is finished.
Simple, no? A voice app works best in exactly that shape. It should not upload one giant audio file, wait silently, and answer after a painful gap. That gap is the awkward pause. Users forgive imperfect wording sooner than dead air. So the pipeline must keep flowing.
mic audio
│
▼
┌──────────────┐ chunks ┌──────────────┐ tokens ┌──────────────┐
│ the ear ├───────────▶│ the brain ├───────────▶│ the voice │
└──────┬───────┘ └──────┬───────┘ └──────┬───────┘
│ │ │
└────────────── relay race ──┴────────────── relay race ──┘
│
▼
user hears audio
That is why WebSockets fit realtime voice work. HTTP request-response feels like posting letters. WebSocket feels like an open headset line. Look. The ear, the brain, and the voice need that open line. Otherwise the relay race keeps stopping at every handoff.
Why the socket stays open¶
See the shape first. One conversation contains many tiny events. Audio chunks arrive every few milliseconds. Partial transcripts appear before final transcripts. Model tokens show up gradually. TTS audio chunks follow the token stream. Interrupts can arrive anytime. Heartbeats keep the line healthy. So what to do?
Keep one channel alive, and pass many events through it.
client headset server stack
│ │
├── audio chunk ──────────────────▶│
├── vad: speech_started ──────────▶│
│ ├── partial transcript ───────▶
│ ├── final transcript ─────────▶
│◀──────── model token ────────────┤
│◀──────── tts audio chunk ────────┤
├── interrupt signal ─────────────▶│
├── heartbeat ────────────────────▶│
│◀──────── heartbeat ack ──────────┤
What moves in the relay race¶
A good pipeline names its events clearly. Do not send mystery blobs. Give each baton a label. Then the voice orchestrator can route traffic safely. Think of the voice orchestrator as the traffic controller. It does not perform every job itself. It decides who goes next, who waits,
and who gets cancelled. The ear, the brain, and the voice stay simpler when orchestration is explicit. Common events look like this.
| Event | Direction | What it means | Why it matters |
|---|---|---|---|
client.audio.chunk |
client → server | Raw mic bytes or encoded frames | Feeds the ear continuously |
client.vad.state |
client → server | Speech started, speech ended, maybe silence | Helps turn-taking logic |
asr.partial |
server → client or internal bus | Early text guess from the ear | Lets UI show live hearing |
asr.final |
server → internal bus | Stable transcript from the ear | Safe input for the brain |
llm.token |
server internal | One token or token group from the brain | Starts answer early |
tts.audio.chunk |
server → client | Synthesized audio bytes from the voice | Starts playback early |
client.interrupt |
client → server | User spoke or pressed stop | Cancels talking fast |
session.heartbeat |
both ways | Keepalive pulse | Detects dead links |
error.stage |
server → client | Named failure point | Makes debugging honest |
| Notice the pattern. Everything is small, typed, and time-sensitive. The | |||
| relay race fails when one baton hides inside a giant package. The relay | |||
race also fails when names are vague. event: update is too fuzzy. |
|||
| Update of what exactly? The ear? The brain? The voice? Be precise. |
That precision helps logs, metrics, and debugging under pressure.
Chunked processing means never waiting for complete files¶
This idea matters more than many people expect. Chunked processing means moving work in small pieces. Never wait for the complete recording if you can avoid it. The ear should transcribe from rolling windows. The brain should begin once meaning is stable enough.
The voice should synthesize as answer segments appear. Look. The relay race works because each runner trusts the next checkpoint. No runner waits for the full marathon to finish.
bad path
user speaks ──▶ record full file ──▶ upload ──▶ transcribe ──▶ generate ──▶ synthesize ──▶ play
better path
user speaks ──▶ chunk 1 ──▶ ear hears ──▶ brain starts ──▶ voice starts ──▶ play chunk 1
chunk 2 ──▶ ear hears ──▶ brain continues ──▶ voice continues ──▶ play chunk 2
If TTFT is healthy, the voice can begin sooner. A practical mental model is this.
- The ear wants stable enough speech quickly.
- The brain wants enough context to begin safely.
- The voice wants text segments early, even if later segments change.
- The relay race wants overlap everywhere possible.
- The awkward pause punishes any stage that waits for full completion. See. Pipeline design is mostly overlap design.
Serial timing versus overlapping timing¶
Now let us make the win concrete. Assume one user finishes speaking. Then the system responds. In a serial design, each stage waits for the previous stage to finish fully. That feels safe, but it burns time.
| Stage | Serial timing |
|---|---|
| End-of-turn detection | 220 ms |
| STT finalize | 260 ms |
| LLM TTFT | 420 ms |
| TTS first audio | 310 ms |
| Playback start buffer | 160 ms |
| Total | 1370 ms |
| That is over one second before the user hears anything useful. Now | |
| overlap the same work. VAD closes the turn. The ear finalizes quickly. | |
| The brain starts as soon as the transcript is good enough. The voice | |
| starts on the first answer segment. Playback begins with a tiny safety | |
| buffer. |
| Stage | Overlapping timing |
|---|---|
| End-of-turn detection | 220 ms |
| STT running overlap | 180 ms |
| LLM TTFT from stable text | 300 ms |
| TTS first audio overlap | 170 ms |
| Playback start buffer | 50 ms |
| Total heard delay | 920 ms |
| Same product goal. Very different feel. The relay race shaved about 450 | |
| milliseconds. Users notice that immediately. Here is the timeline | |
| picture. |
serial
┌────220────┐┌────260────┐┌────420────┐┌────310────┐┌──160──┐
VAD done STT final LLM TTFT TTS first play
▼
1370 ms
overlap
┌────220────┐
VAD done
└────180────┐
STT stable └────300────┐
LLM TTFT └────170────┐┌─50─┐
TTS first play
▼
920 ms
A practical latency budget by stage¶
Even before formal budgeting, you should sketch stage targets. That keeps arguments honest. A sample starting budget looks like this.
| Stage | Good starting target | What to watch |
|---|---|---|
| Mic capture and network send | 30-80 ms | Chunk size, mobile jitter |
| End-of-turn detection | 200-400 ms | Silence threshold too long |
| Ear partial-to-final | 100-250 ms | Decoder speed, language mix |
| Brain TTFT | 200-500 ms | Prompt size, model load |
| Voice first audio | 100-300 ms | TTS setup and buffering |
| Playback start | 50-100 ms | Client jitter buffer |
| This is not the full budgeting lesson yet. But you can already see the | ||
| suspects. If the awkward pause is large, check stage by stage. Do not | ||
| say, "the system is slow," and stop there. Ask better questions. Is the | ||
| ear late? Is the brain verbose before answering? |
Is the voice waiting for too much text? Is the relay race blocked by orchestration? That is how senior debugging begins.
Where this lives in the wild¶
- OpenAI Realtime voice demo — realtime engineer: streams mic frames, model events, and audio replies over one persistent socket.
- Google Meet translated captions — speech platform engineer: overlaps hearing, understanding, and speaking so conversations stay natural.
- Twilio voice bot stack — voice application engineer: treats interrupts, partial transcripts, and TTS chunks as live events.
- Duolingo speaking tutor — conversational AI engineer: starts feedback quickly instead of waiting for perfect full-turn analysis.
- Zoom AI Companion voice features — orchestration engineer: coordinates many event types while keeping the awkward pause under control.
Pause and recall¶
- Why does a WebSocket fit the relay race better than repeated HTTP requests?
- What does chunked processing change for the ear, the brain, and the voice?
- Why is TTFT useful even though voice systems care about more than text?
- In the timing example, what exactly created the drop from 1370 ms to 920 ms?
Interview Q&A¶
Q: Why do realtime voice systems prefer long-lived, bidirectional WebSockets? A: Because one conversation carries many tiny events in both directions, and reopening connections adds avoidable latency and state complexity. Common wrong answer to avoid: "Because WebSockets are newer, so they are always faster." Q: What is the voice orchestrator doing in a cascaded pipeline? A: It acts like a traffic controller, routing audio, transcript, token, and interruption events so each specialist service stays coordinated. Common wrong answer to avoid: "It is just another name for the LLM." Q: Why is chunked processing essential for voice UX? A: Small pieces let downstream stages begin early, which cuts the awkward pause and keeps the relay race moving. Common wrong answer to avoid: "Chunking mainly helps storage, not responsiveness." Q: What does TTFT tell you in a voice stack? A: It tells you when the brain first emits usable output, which strongly affects how soon the voice can begin speaking. Common wrong answer to avoid: "TTFT is only relevant for text chat interfaces."
Apply now (5 min)¶
Exercise. Draw a WebSocket event list for one user turn.
Include client.audio.chunk, client.vad.state, asr.partial,
asr.final, llm.token, tts.audio.chunk, and client.interrupt.
Then circle which component owns each event.
Sketch from memory. Redraw the relay race diagram. Label where the ear, the brain, and the voice overlap. Mark the awkward pause with a red warning in your notebook.
Bridge. The pipeline works because stages overlap. Next we assign formal budgets to each stage, instead of relying on vibes. → 06-latency-budgeting.md