Skip to content

05. Streaming Pipeline — the relay race must keep moving

~16 min read. Voice feels magical when work overlaps instead of waiting politely.

Built on the ELI5 in 00-eli5.md. The relay race — one baton passed at the right instant — shows how the ear, the brain, and the voice cooperate without dropping time.


First picture: think interpreter booth, not voicemail inbox

Picture the United Nations interpreter booth. One speaker talks. The interpreter does not wait for the whole speech. They hear a phrase, process a phrase, and speak a phrase. That is the relay race. The ear catches sound. The brain builds meaning. The voice starts speaking before everything is finished.

Simple, no? A voice app works best in exactly that shape. It should not upload one giant audio file, wait silently, and answer after a painful gap. That gap is the awkward pause. Users forgive imperfect wording sooner than dead air. So the pipeline must keep flowing.

mic audio
┌──────────────┐   chunks   ┌──────────────┐   tokens   ┌──────────────┐
│   the ear    ├───────────▶│  the brain   ├───────────▶│  the voice   │
└──────┬───────┘            └──────┬───────┘            └──────┬───────┘
       │                            │                            │
       └────────────── relay race ──┴────────────── relay race ──┘
                                user hears audio
Now add transport. Usually that transport is a WebSocket. A WebSocket is a long-lived connection. It stays open across many small messages. And it is bidirectional. Client sends audio up. Server sends text and audio down. Both sides can also send control signals.

That is why WebSockets fit realtime voice work. HTTP request-response feels like posting letters. WebSocket feels like an open headset line. Look. The ear, the brain, and the voice need that open line. Otherwise the relay race keeps stopping at every handoff.

Why the socket stays open

See the shape first. One conversation contains many tiny events. Audio chunks arrive every few milliseconds. Partial transcripts appear before final transcripts. Model tokens show up gradually. TTS audio chunks follow the token stream. Interrupts can arrive anytime. Heartbeats keep the line healthy. So what to do?

Keep one channel alive, and pass many events through it.

client headset                     server stack
     │                                  │
     ├── audio chunk ──────────────────▶│
     ├── vad: speech_started ──────────▶│
     │                                  ├── partial transcript ───────▶
     │                                  ├── final transcript ─────────▶
     │◀──────── model token ────────────┤
     │◀──────── tts audio chunk ────────┤
     ├── interrupt signal ─────────────▶│
     ├── heartbeat ────────────────────▶│
     │◀──────── heartbeat ack ──────────┤
This long-lived socket reduces setup overhead. You authenticate once, attach session state once, and keep moving. Yes? If you reopen a fresh connection for every stage, latency grows, state handling gets messy, and failure paths multiply. That is how the awkward pause sneaks back.

What moves in the relay race

A good pipeline names its events clearly. Do not send mystery blobs. Give each baton a label. Then the voice orchestrator can route traffic safely. Think of the voice orchestrator as the traffic controller. It does not perform every job itself. It decides who goes next, who waits,

and who gets cancelled. The ear, the brain, and the voice stay simpler when orchestration is explicit. Common events look like this.

Event Direction What it means Why it matters
client.audio.chunk client → server Raw mic bytes or encoded frames Feeds the ear continuously
client.vad.state client → server Speech started, speech ended, maybe silence Helps turn-taking logic
asr.partial server → client or internal bus Early text guess from the ear Lets UI show live hearing
asr.final server → internal bus Stable transcript from the ear Safe input for the brain
llm.token server internal One token or token group from the brain Starts answer early
tts.audio.chunk server → client Synthesized audio bytes from the voice Starts playback early
client.interrupt client → server User spoke or pressed stop Cancels talking fast
session.heartbeat both ways Keepalive pulse Detects dead links
error.stage server → client Named failure point Makes debugging honest
Notice the pattern. Everything is small, typed, and time-sensitive. The
relay race fails when one baton hides inside a giant package. The relay
race also fails when names are vague. event: update is too fuzzy.
Update of what exactly? The ear? The brain? The voice? Be precise.

That precision helps logs, metrics, and debugging under pressure.

Chunked processing means never waiting for complete files

This idea matters more than many people expect. Chunked processing means moving work in small pieces. Never wait for the complete recording if you can avoid it. The ear should transcribe from rolling windows. The brain should begin once meaning is stable enough.

The voice should synthesize as answer segments appear. Look. The relay race works because each runner trusts the next checkpoint. No runner waits for the full marathon to finish.

bad path
user speaks ──▶ record full file ──▶ upload ──▶ transcribe ──▶ generate ──▶ synthesize ──▶ play

better path
user speaks ──▶ chunk 1 ──▶ ear hears ──▶ brain starts ──▶ voice starts ──▶ play chunk 1
                 chunk 2 ──▶ ear hears ──▶ brain continues ──▶ voice continues ──▶ play chunk 2
This is also where TTFT appears. TTFT means time to first token. For voice systems, you should care about more than TTFT. Still, TTFT is a very useful checkpoint. It tells you when the brain first contributes visible output. If TTFT is slow, the awkward pause grows.

If TTFT is healthy, the voice can begin sooner. A practical mental model is this.

  • The ear wants stable enough speech quickly.
  • The brain wants enough context to begin safely.
  • The voice wants text segments early, even if later segments change.
  • The relay race wants overlap everywhere possible.
  • The awkward pause punishes any stage that waits for full completion. See. Pipeline design is mostly overlap design.

Serial timing versus overlapping timing

Now let us make the win concrete. Assume one user finishes speaking. Then the system responds. In a serial design, each stage waits for the previous stage to finish fully. That feels safe, but it burns time.

Stage Serial timing
End-of-turn detection 220 ms
STT finalize 260 ms
LLM TTFT 420 ms
TTS first audio 310 ms
Playback start buffer 160 ms
Total 1370 ms
That is over one second before the user hears anything useful. Now
overlap the same work. VAD closes the turn. The ear finalizes quickly.
The brain starts as soon as the transcript is good enough. The voice
starts on the first answer segment. Playback begins with a tiny safety
buffer.
Stage Overlapping timing
End-of-turn detection 220 ms
STT running overlap 180 ms
LLM TTFT from stable text 300 ms
TTS first audio overlap 170 ms
Playback start buffer 50 ms
Total heard delay 920 ms
Same product goal. Very different feel. The relay race shaved about 450
milliseconds. Users notice that immediately. Here is the timeline
picture.

serial
┌────220────┐┌────260────┐┌────420────┐┌────310────┐┌──160──┐
VAD done     STT final    LLM TTFT     TTS first    play
                                                  1370 ms

overlap
┌────220────┐
VAD done
     └────180────┐
     STT stable  └────300────┐
                    LLM TTFT  └────170────┐┌─50─┐
                                 TTS first play
                                         920 ms
Simple, no? Overlap does not mean chaos. Overlap means planned concurrency with named handoffs.

A practical latency budget by stage

Even before formal budgeting, you should sketch stage targets. That keeps arguments honest. A sample starting budget looks like this.

Stage Good starting target What to watch
Mic capture and network send 30-80 ms Chunk size, mobile jitter
End-of-turn detection 200-400 ms Silence threshold too long
Ear partial-to-final 100-250 ms Decoder speed, language mix
Brain TTFT 200-500 ms Prompt size, model load
Voice first audio 100-300 ms TTS setup and buffering
Playback start 50-100 ms Client jitter buffer
This is not the full budgeting lesson yet. But you can already see the
suspects. If the awkward pause is large, check stage by stage. Do not
say, "the system is slow," and stop there. Ask better questions. Is the
ear late? Is the brain verbose before answering?

Is the voice waiting for too much text? Is the relay race blocked by orchestration? That is how senior debugging begins.


Where this lives in the wild

  • OpenAI Realtime voice demo — realtime engineer: streams mic frames, model events, and audio replies over one persistent socket.
  • Google Meet translated captions — speech platform engineer: overlaps hearing, understanding, and speaking so conversations stay natural.
  • Twilio voice bot stack — voice application engineer: treats interrupts, partial transcripts, and TTS chunks as live events.
  • Duolingo speaking tutor — conversational AI engineer: starts feedback quickly instead of waiting for perfect full-turn analysis.
  • Zoom AI Companion voice features — orchestration engineer: coordinates many event types while keeping the awkward pause under control.

Pause and recall

  • Why does a WebSocket fit the relay race better than repeated HTTP requests?
  • What does chunked processing change for the ear, the brain, and the voice?
  • Why is TTFT useful even though voice systems care about more than text?
  • In the timing example, what exactly created the drop from 1370 ms to 920 ms?

Interview Q&A

Q: Why do realtime voice systems prefer long-lived, bidirectional WebSockets? A: Because one conversation carries many tiny events in both directions, and reopening connections adds avoidable latency and state complexity. Common wrong answer to avoid: "Because WebSockets are newer, so they are always faster." Q: What is the voice orchestrator doing in a cascaded pipeline? A: It acts like a traffic controller, routing audio, transcript, token, and interruption events so each specialist service stays coordinated. Common wrong answer to avoid: "It is just another name for the LLM." Q: Why is chunked processing essential for voice UX? A: Small pieces let downstream stages begin early, which cuts the awkward pause and keeps the relay race moving. Common wrong answer to avoid: "Chunking mainly helps storage, not responsiveness." Q: What does TTFT tell you in a voice stack? A: It tells you when the brain first emits usable output, which strongly affects how soon the voice can begin speaking. Common wrong answer to avoid: "TTFT is only relevant for text chat interfaces."


Apply now (5 min)

Exercise. Draw a WebSocket event list for one user turn. Include client.audio.chunk, client.vad.state, asr.partial, asr.final, llm.token, tts.audio.chunk, and client.interrupt. Then circle which component owns each event.

Sketch from memory. Redraw the relay race diagram. Label where the ear, the brain, and the voice overlap. Mark the awkward pause with a red warning in your notebook.


Bridge. The pipeline works because stages overlap. Next we assign formal budgets to each stage, instead of relying on vibes. → 06-latency-budgeting.md