03. Week 18 — Study Material
Theme
Voice AI is a latency engineering problem disguised as a model problem. Read this beside 02_explainer.md, not after it.
How to use this file
- First pass: read 02_explainer.md §1-§6 for the story.
- Second pass: use this file for vendor choices, protocol notes, and production references.
- Third pass: answer the self-check in §9 aloud.
§1. Reading order
- 02_explainer.md §1-§2 — why voice latency feels socially broken.
- 02_explainer.md §3 — ASR, sample rate, timestamps, VAD, endpointing.
- 02_explainer.md §4 — TTS, first-audio latency, prosody, cloning.
- 02_explainer.md §5 — WebSocket pipeline, TTFT, barge-in, turn-taking.
- 02_explainer.md §6 — end-to-end voice models.
- Return here for the tables, reading links, and interview notes.
§2. ASR stack reference
| Option |
Mode |
Why teams use it |
Watch-outs |
| Whisper |
Open baseline |
High quality, multilingual, strong benchmark reputation |
Batch-shaped by default; extra work for realtime |
| Distil-Whisper |
Faster open baseline |
Lower cost and faster inference |
Quality drop on harder audio |
| Deepgram |
Hosted streaming |
Low-latency streaming, timestamps, production convenience |
Ongoing vendor cost |
| AssemblyAI |
Hosted streaming |
Strong speech features and analytics ecosystem |
Latency and pricing depend on SKU |
| whisper.cpp / MLX |
On-device |
Privacy, offline use, no network round-trip |
Device constraints and tuning burden |
Notes
- Use streaming-native ASR when natural turn-taking matters.
- Keep word-level timestamps when you want alignment, analytics, or better debugging.
- Test on accents, noisy rooms, and channel conditions that mirror real users.
§3. TTS stack reference
| Option |
Mode |
Why teams use it |
Watch-outs |
| ElevenLabs |
Streaming |
Strong naturalness and voice quality |
Premium cost and cloning governance needed |
| Cartesia |
Streaming |
Low-latency first audio |
Smaller ecosystem than hyperscalers |
| OpenAI TTS |
Streaming |
Simple stack fit when already using OpenAI |
Voice choices and controls vary by offering |
| Google Cloud / Polly |
API TTS |
Enterprise familiarity and broad language inventory |
May not win on naturalness |
| Coqui / open models |
Self-host |
Cost or data control |
More integration and infra work |
Notes
- First-audio latency matters more than total synthesis time for voice UX.
- Prosody control is a product decision, not just a model feature.
- Voice cloning should never skip consent, disclosure, and abuse policy.
§4. Turn-taking and endpointing notes
- VAD detects speech presence, frame by frame.
- Endpointing decides that the utterance is truly over.
- Silence-only endpointing is simple and fast, but cuts off slow speakers.
- Semantic endpointing is smarter, but may add latency.
- Hybrid endpointing is the default interview answer for real products.
Common failure cases
- The user pauses to think, and the agent interrupts.
- The user says “uh… tomorrow morning,” and silence-only logic commits too early.
- Background noise trips VAD and creates phantom turns.
- The model hears the words correctly, but the endpointing policy still feels slow.
§5. WebSocket and pipeline notes
- WebSockets are persistent and bidirectional, which suits streaming audio and event traffic.
- A voice client usually sends audio chunks, keepalives, and interruption signals.
- A server usually returns partial transcripts, final transcripts, tokens, audio chunks, and control events.
- A good orchestrator tracks stable transcript state, in-flight LLM calls, playback state, and cancellation.
Minimal event list
audio_chunk
vad_state
partial_transcript
final_transcript
llm_first_token
tts_audio_chunk
interrupt
session_heartbeat
§6. End-to-end voice model notes
- Native speech-to-speech systems reduce glue code and can feel more natural.
- They often win for fast prototypes, consumer assistants, and demos.
- Cascaded pipelines still win when observability, control, or compliance matter most.
- Keep both answers ready in interviews.
§7. Latency budget template
| Stage |
Target p95 |
What to inspect if slow |
| End-of-turn detection |
200-400 ms |
Silence threshold, semantic checks, VAD noise |
| STT finalization |
100-250 ms |
Region placement, chunking, vendor behavior |
| LLM TTFT |
200-500 ms |
Prompt size, model size, cache, provider load |
| TTS first audio |
100-300 ms |
Chunking strategy, vendor, playback buffering |
| Playback / jitter |
50-100 ms |
Client buffers, device output, network smoothing |
One-line rule
If you cannot name which stage owns the missing milliseconds, you are not debugging yet.
§8. Interview framing
- Start with the user experience problem, not model fandom.
- Define the stage-level latency budget before naming vendors.
- Separate VAD, endpointing, and barge-in clearly.
- Mention p95, not only averages.
- Mention governance if voice cloning appears.
- Mention telephony constraints if the use case is customer support.
§9. Self-check with references
- Explain why voice latency feels socially harsher than text latency. See 02_explainer.md §1-§2.
- Give a high-level Whisper architecture summary. See 02_explainer.md §3.
- What is the difference between word-level timestamps and endpointing? See 02_explainer.md §3.
- Why is TTFA more user-relevant than total TTS time? See 02_explainer.md §4.
- Draw the relay-race pipeline and name one failure mode at each handoff. See 02_explainer.md §5.
- When does an end-to-end voice model beat a cascaded pipeline? See 02_explainer.md §6.
- Name two honest limitations of present-day voice AI. See 02_explainer.md §8.
- Which of the four foundation gaps still needs repair for you? See 02_explainer.md §9.
§10. Reading list
- Whisper paper (Radford et al., 2022).
- OpenAI Realtime API docs.
- Deepgram or AssemblyAI streaming documentation.
- ElevenLabs or Cartesia streaming TTS docs.
- Silero VAD documentation.
- LiveKit Agents or Pipecat architecture docs.
- One telephony vendor guide for PSTN bridging, such as Twilio Voice.
Study completion health check
- [ ] I can explain the relay-race metaphor without notes.
- [ ] I can tell VAD, endpointing, and barge-in apart in one minute.
- [ ] I know my preferred ASR, LLM, and TTS default stack and why.
- [ ] I can defend a p95 latency budget with stage names.
- [ ] I know when an end-to-end voice model is the better abstraction.