05. Streaming Pipeline — the relay race must keep moving¶

~16 min read. Voice feels magical when work overlaps instead of waiting politely.

Built on the ELI5 in 00-eli5.md. The relay race — one baton passed at the right instant — shows how the ear, the brain, and the voice cooperate without dropping time.

First picture: think interpreter booth, not voicemail inbox¶

Picture the United Nations interpreter booth. One speaker talks. The interpreter does not wait for the whole speech. They hear a phrase, process a phrase, and speak a phrase. That is the relay race. The ear catches sound. The brain builds meaning. The voice starts speaking before everything is finished.

Simple, no? A voice app works best in exactly that shape. It should not upload one giant audio file, wait silently, and answer after a painful gap. That gap is the awkward pause. Users forgive imperfect wording sooner than dead air. So the pipeline must keep flowing.

mic audio
   │
   ▼
┌──────────────┐   chunks   ┌──────────────┐   tokens   ┌──────────────┐
│   the ear    ├───────────▶│  the brain   ├───────────▶│  the voice   │
└──────┬───────┘            └──────┬───────┘            └──────┬───────┘
       │                            │                            │
       └────────────── relay race ──┴────────────── relay race ──┘
                                      │
                                      ▼
                                user hears audio

Now add transport. Usually that transport is a WebSocket. A WebSocket is a long-lived connection. It stays open across many small messages. And it is bidirectional. Client sends audio up. Server sends text and audio down. Both sides can also send control signals.

That is why WebSockets fit realtime voice work. HTTP request-response feels like posting letters. WebSocket feels like an open headset line. Look. The ear, the brain, and the voice need that open line. Otherwise the relay race keeps stopping at every handoff.

Why the socket stays open¶

See the shape first. One conversation contains many tiny events. Audio chunks arrive every few milliseconds. Partial transcripts appear before final transcripts. Model tokens show up gradually. TTS audio chunks follow the token stream. Interrupts can arrive anytime. Heartbeats keep the line healthy. So what to do?

Keep one channel alive, and pass many events through it.

client headset                     server stack
     │                                  │
     ├── audio chunk ──────────────────▶│
     ├── vad: speech_started ──────────▶│
     │                                  ├── partial transcript ───────▶
     │                                  ├── final transcript ─────────▶
     │◀──────── model token ────────────┤
     │◀──────── tts audio chunk ────────┤
     ├── interrupt signal ─────────────▶│
     ├── heartbeat ────────────────────▶│
     │◀──────── heartbeat ack ──────────┤

This long-lived socket reduces setup overhead. You authenticate once, attach session state once, and keep moving. Yes? If you reopen a fresh connection for every stage, latency grows, state handling gets messy, and failure paths multiply. That is how the awkward pause sneaks back.

What moves in the relay race¶

A good pipeline names its events clearly. Do not send mystery blobs. Give each baton a label. Then the voice orchestrator can route traffic safely. Think of the voice orchestrator as the traffic controller. It does not perform every job itself. It decides who goes next, who waits,

and who gets cancelled. The ear, the brain, and the voice stay simpler when orchestration is explicit. Common events look like this.

Event	Direction	What it means	Why it matters
`client.audio.chunk`	client → server	Raw mic bytes or encoded frames	Feeds the ear continuously
`client.vad.state`	client → server	Speech started, speech ended, maybe silence	Helps turn-taking logic
`asr.partial`	server → client or internal bus	Early text guess from the ear	Lets UI show live hearing
`asr.final`	server → internal bus	Stable transcript from the ear	Safe input for the brain
`llm.token`	server internal	One token or token group from the brain	Starts answer early
`tts.audio.chunk`	server → client	Synthesized audio bytes from the voice	Starts playback early
`client.interrupt`	client → server	User spoke or pressed stop	Cancels talking fast
`session.heartbeat`	both ways	Keepalive pulse	Detects dead links
`error.stage`	server → client	Named failure point	Makes debugging honest
Notice the pattern. Everything is small, typed, and time-sensitive. The
relay race fails when one baton hides inside a giant package. The relay
race also fails when names are vague. `event: update` is too fuzzy.
Update of what exactly? The ear? The brain? The voice? Be precise.

That precision helps logs, metrics, and debugging under pressure.

Chunked processing means never waiting for complete files¶

This idea matters more than many people expect. Chunked processing means moving work in small pieces. Never wait for the complete recording if you can avoid it. The ear should transcribe from rolling windows. The brain should begin once meaning is stable enough.

The voice should synthesize as answer segments appear. Look. The relay race works because each runner trusts the next checkpoint. No runner waits for the full marathon to finish.

bad path
user speaks ──▶ record full file ──▶ upload ──▶ transcribe ──▶ generate ──▶ synthesize ──▶ play

better path
user speaks ──▶ chunk 1 ──▶ ear hears ──▶ brain starts ──▶ voice starts ──▶ play chunk 1
                 chunk 2 ──▶ ear hears ──▶ brain continues ──▶ voice continues ──▶ play chunk 2

This is also where TTFT appears. TTFT means time to first token. For voice systems, you should care about more than TTFT. Still, TTFT is a very useful checkpoint. It tells you when the brain first contributes visible output. If TTFT is slow, the awkward pause grows.

If TTFT is healthy, the voice can begin sooner. A practical mental model is this.

The ear wants stable enough speech quickly.
The brain wants enough context to begin safely.
The voice wants text segments early, even if later segments change.
The relay race wants overlap everywhere possible.
The awkward pause punishes any stage that waits for full completion. See. Pipeline design is mostly overlap design.

Serial timing versus overlapping timing¶

Now let us make the win concrete. Assume one user finishes speaking. Then the system responds. In a serial design, each stage waits for the previous stage to finish fully. That feels safe, but it burns time.

Stage	Serial timing
End-of-turn detection	220 ms
STT finalize	260 ms
LLM TTFT	420 ms
TTS first audio	310 ms
Playback start buffer	160 ms
Total	1370 ms
That is over one second before the user hears anything useful. Now
overlap the same work. VAD closes the turn. The ear finalizes quickly.
The brain starts as soon as the transcript is good enough. The voice
starts on the first answer segment. Playback begins with a tiny safety
buffer.

Stage	Overlapping timing
End-of-turn detection	220 ms
STT running overlap	180 ms
LLM TTFT from stable text	300 ms
TTS first audio overlap	170 ms
Playback start buffer	50 ms
Total heard delay	920 ms
Same product goal. Very different feel. The relay race shaved about 450
milliseconds. Users notice that immediately. Here is the timeline
picture.

serial
┌────220────┐┌────260────┐┌────420────┐┌────310────┐┌──160──┐
VAD done     STT final    LLM TTFT     TTS first    play
                                                    ▼
                                                  1370 ms

overlap
┌────220────┐
VAD done
     └────180────┐
     STT stable  └────300────┐
                    LLM TTFT  └────170────┐┌─50─┐
                                 TTS first play
                                           ▼
                                         920 ms

Simple, no? Overlap does not mean chaos. Overlap means planned concurrency with named handoffs.

A practical latency budget by stage¶

Even before formal budgeting, you should sketch stage targets. That keeps arguments honest. A sample starting budget looks like this.

Stage	Good starting target	What to watch
Mic capture and network send	30-80 ms	Chunk size, mobile jitter
End-of-turn detection	200-400 ms	Silence threshold too long
Ear partial-to-final	100-250 ms	Decoder speed, language mix
Brain TTFT	200-500 ms	Prompt size, model load
Voice first audio	100-300 ms	TTS setup and buffering
Playback start	50-100 ms	Client jitter buffer
This is not the full budgeting lesson yet. But you can already see the
suspects. If the awkward pause is large, check stage by stage. Do not
say, "the system is slow," and stop there. Ask better questions. Is the
ear late? Is the brain verbose before answering?

Is the voice waiting for too much text? Is the relay race blocked by orchestration? That is how senior debugging begins.

Where this lives in the wild¶

OpenAI Realtime voice demo — realtime engineer: streams mic frames, model events, and audio replies over one persistent socket.
Google Meet translated captions — speech platform engineer: overlaps hearing, understanding, and speaking so conversations stay natural.
Twilio voice bot stack — voice application engineer: treats interrupts, partial transcripts, and TTS chunks as live events.
Duolingo speaking tutor — conversational AI engineer: starts feedback quickly instead of waiting for perfect full-turn analysis.
Zoom AI Companion voice features — orchestration engineer: coordinates many event types while keeping the awkward pause under control.

Pause and recall¶

Why does a WebSocket fit the relay race better than repeated HTTP requests?
What does chunked processing change for the ear, the brain, and the voice?
Why is TTFT useful even though voice systems care about more than text?
In the timing example, what exactly created the drop from 1370 ms to 920 ms?

Interview Q&A¶

Q: Why do realtime voice systems prefer long-lived, bidirectional WebSockets? A: Because one conversation carries many tiny events in both directions, and reopening connections adds avoidable latency and state complexity. Common wrong answer to avoid: "Because WebSockets are newer, so they are always faster." Q: What is the voice orchestrator doing in a cascaded pipeline? A: It acts like a traffic controller, routing audio, transcript, token, and interruption events so each specialist service stays coordinated. Common wrong answer to avoid: "It is just another name for the LLM." Q: Why is chunked processing essential for voice UX? A: Small pieces let downstream stages begin early, which cuts the awkward pause and keeps the relay race moving. Common wrong answer to avoid: "Chunking mainly helps storage, not responsiveness." Q: What does TTFT tell you in a voice stack? A: It tells you when the brain first emits usable output, which strongly affects how soon the voice can begin speaking. Common wrong answer to avoid: "TTFT is only relevant for text chat interfaces."

Apply now (5 min)¶

Exercise. Draw a WebSocket event list for one user turn. Include client.audio.chunk, client.vad.state, asr.partial, asr.final, llm.token, tts.audio.chunk, and client.interrupt. Then circle which component owns each event.

Sketch from memory. Redraw the relay race diagram. Label where the ear, the brain, and the voice overlap. Mark the awkward pause with a red warning in your notebook.

Bridge. The pipeline works because stages overlap. Next we assign formal budgets to each stage, instead of relying on vibes. → 06-latency-budgeting.md