03. Week 18 — Study Material¶

Theme¶

Voice AI is a latency engineering problem disguised as a model problem. Read this beside 02_explainer.md, not after it.

First pass: read 02_explainer.md §1-§6 for the story.
Second pass: use this file for vendor choices, protocol notes, and production references.
Third pass: answer the self-check in §9 aloud.

Option	Mode	Why teams use it	Watch-outs
Whisper	Open baseline	High quality, multilingual, strong benchmark reputation	Batch-shaped by default; extra work for realtime
Distil-Whisper	Faster open baseline	Lower cost and faster inference	Quality drop on harder audio
Deepgram	Hosted streaming	Low-latency streaming, timestamps, production convenience	Ongoing vendor cost
AssemblyAI	Hosted streaming	Strong speech features and analytics ecosystem	Latency and pricing depend on SKU
whisper.cpp / MLX	On-device	Privacy, offline use, no network round-trip	Device constraints and tuning burden

Use streaming-native ASR when natural turn-taking matters.
Keep word-level timestamps when you want alignment, analytics, or better debugging.
Test on accents, noisy rooms, and channel conditions that mirror real users.

Option	Mode	Why teams use it	Watch-outs
ElevenLabs	Streaming	Strong naturalness and voice quality	Premium cost and cloning governance needed
Cartesia	Streaming	Low-latency first audio	Smaller ecosystem than hyperscalers
OpenAI TTS	Streaming	Simple stack fit when already using OpenAI	Voice choices and controls vary by offering
Google Cloud / Polly	API TTS	Enterprise familiarity and broad language inventory	May not win on naturalness
Coqui / open models	Self-host	Cost or data control	More integration and infra work

The user pauses to think, and the agent interrupts.
The user says “uh… tomorrow morning,” and silence-only logic commits too early.
Background noise trips VAD and creates phantom turns.
The model hears the words correctly, but the endpointing policy still feels slow.

WebSockets are persistent and bidirectional, which suits streaming audio and event traffic.
A voice client usually sends audio chunks, keepalives, and interruption signals.
A server usually returns partial transcripts, final transcripts, tokens, audio chunks, and control events.
A good orchestrator tracks stable transcript state, in-flight LLM calls, playback state, and cancellation.

Native speech-to-speech systems reduce glue code and can feel more natural.
They often win for fast prototypes, consumer assistants, and demos.
Cascaded pipelines still win when observability, control, or compliance matter most.
Keep both answers ready in interviews.

Stage	Target p95	What to inspect if slow
End-of-turn detection	200-400 ms	Silence threshold, semantic checks, VAD noise
STT finalization	100-250 ms	Region placement, chunking, vendor behavior
LLM TTFT	200-500 ms	Prompt size, model size, cache, provider load
TTS first audio	100-300 ms	Chunking strategy, vendor, playback buffering
Playback / jitter	50-100 ms	Client buffers, device output, network smoothing

If you cannot name which stage owns the missing milliseconds, you are not debugging yet.

Explain why voice latency feels socially harsher than text latency. See 02_explainer.md §1-§2.
Give a high-level Whisper architecture summary. See 02_explainer.md §3.
What is the difference between word-level timestamps and endpointing? See 02_explainer.md §3.
Why is TTFA more user-relevant than total TTS time? See 02_explainer.md §4.
Draw the relay-race pipeline and name one failure mode at each handoff. See 02_explainer.md §5.
When does an end-to-end voice model beat a cascaded pipeline? See 02_explainer.md §6.
Name two honest limitations of present-day voice AI. See 02_explainer.md §8.
Which of the four foundation gaps still needs repair for you? See 02_explainer.md §9.