Skip to content

03. Week 18 — Study Material

Theme

Voice AI is a latency engineering problem disguised as a model problem. Read this beside 02_explainer.md, not after it.

How to use this file

  • First pass: read 02_explainer.md §1-§6 for the story.
  • Second pass: use this file for vendor choices, protocol notes, and production references.
  • Third pass: answer the self-check in §9 aloud.

§1. Reading order

  1. 02_explainer.md §1-§2 — why voice latency feels socially broken.
  2. 02_explainer.md §3 — ASR, sample rate, timestamps, VAD, endpointing.
  3. 02_explainer.md §4 — TTS, first-audio latency, prosody, cloning.
  4. 02_explainer.md §5 — WebSocket pipeline, TTFT, barge-in, turn-taking.
  5. 02_explainer.md §6 — end-to-end voice models.
  6. Return here for the tables, reading links, and interview notes.

§2. ASR stack reference

Option Mode Why teams use it Watch-outs
Whisper Open baseline High quality, multilingual, strong benchmark reputation Batch-shaped by default; extra work for realtime
Distil-Whisper Faster open baseline Lower cost and faster inference Quality drop on harder audio
Deepgram Hosted streaming Low-latency streaming, timestamps, production convenience Ongoing vendor cost
AssemblyAI Hosted streaming Strong speech features and analytics ecosystem Latency and pricing depend on SKU
whisper.cpp / MLX On-device Privacy, offline use, no network round-trip Device constraints and tuning burden

Notes

  • Use streaming-native ASR when natural turn-taking matters.
  • Keep word-level timestamps when you want alignment, analytics, or better debugging.
  • Test on accents, noisy rooms, and channel conditions that mirror real users.

§3. TTS stack reference

Option Mode Why teams use it Watch-outs
ElevenLabs Streaming Strong naturalness and voice quality Premium cost and cloning governance needed
Cartesia Streaming Low-latency first audio Smaller ecosystem than hyperscalers
OpenAI TTS Streaming Simple stack fit when already using OpenAI Voice choices and controls vary by offering
Google Cloud / Polly API TTS Enterprise familiarity and broad language inventory May not win on naturalness
Coqui / open models Self-host Cost or data control More integration and infra work

Notes

  • First-audio latency matters more than total synthesis time for voice UX.
  • Prosody control is a product decision, not just a model feature.
  • Voice cloning should never skip consent, disclosure, and abuse policy.

§4. Turn-taking and endpointing notes

  • VAD detects speech presence, frame by frame.
  • Endpointing decides that the utterance is truly over.
  • Silence-only endpointing is simple and fast, but cuts off slow speakers.
  • Semantic endpointing is smarter, but may add latency.
  • Hybrid endpointing is the default interview answer for real products.

Common failure cases

  1. The user pauses to think, and the agent interrupts.
  2. The user says “uh… tomorrow morning,” and silence-only logic commits too early.
  3. Background noise trips VAD and creates phantom turns.
  4. The model hears the words correctly, but the endpointing policy still feels slow.

§5. WebSocket and pipeline notes

  • WebSockets are persistent and bidirectional, which suits streaming audio and event traffic.
  • A voice client usually sends audio chunks, keepalives, and interruption signals.
  • A server usually returns partial transcripts, final transcripts, tokens, audio chunks, and control events.
  • A good orchestrator tracks stable transcript state, in-flight LLM calls, playback state, and cancellation.

Minimal event list

  • audio_chunk
  • vad_state
  • partial_transcript
  • final_transcript
  • llm_first_token
  • tts_audio_chunk
  • interrupt
  • session_heartbeat

§6. End-to-end voice model notes

  • Native speech-to-speech systems reduce glue code and can feel more natural.
  • They often win for fast prototypes, consumer assistants, and demos.
  • Cascaded pipelines still win when observability, control, or compliance matter most.
  • Keep both answers ready in interviews.

§7. Latency budget template

Stage Target p95 What to inspect if slow
End-of-turn detection 200-400 ms Silence threshold, semantic checks, VAD noise
STT finalization 100-250 ms Region placement, chunking, vendor behavior
LLM TTFT 200-500 ms Prompt size, model size, cache, provider load
TTS first audio 100-300 ms Chunking strategy, vendor, playback buffering
Playback / jitter 50-100 ms Client buffers, device output, network smoothing

One-line rule

If you cannot name which stage owns the missing milliseconds, you are not debugging yet.

§8. Interview framing

  • Start with the user experience problem, not model fandom.
  • Define the stage-level latency budget before naming vendors.
  • Separate VAD, endpointing, and barge-in clearly.
  • Mention p95, not only averages.
  • Mention governance if voice cloning appears.
  • Mention telephony constraints if the use case is customer support.

§9. Self-check with references

  1. Explain why voice latency feels socially harsher than text latency. See 02_explainer.md §1-§2.
  2. Give a high-level Whisper architecture summary. See 02_explainer.md §3.
  3. What is the difference between word-level timestamps and endpointing? See 02_explainer.md §3.
  4. Why is TTFA more user-relevant than total TTS time? See 02_explainer.md §4.
  5. Draw the relay-race pipeline and name one failure mode at each handoff. See 02_explainer.md §5.
  6. When does an end-to-end voice model beat a cascaded pipeline? See 02_explainer.md §6.
  7. Name two honest limitations of present-day voice AI. See 02_explainer.md §8.
  8. Which of the four foundation gaps still needs repair for you? See 02_explainer.md §9.

§10. Reading list

  1. Whisper paper (Radford et al., 2022).
  2. OpenAI Realtime API docs.
  3. Deepgram or AssemblyAI streaming documentation.
  4. ElevenLabs or Cartesia streaming TTS docs.
  5. Silero VAD documentation.
  6. LiveKit Agents or Pipecat architecture docs.
  7. One telephony vendor guide for PSTN bridging, such as Twilio Voice.

Study completion health check

  • [ ] I can explain the relay-race metaphor without notes.
  • [ ] I can tell VAD, endpointing, and barge-in apart in one minute.
  • [ ] I know my preferred ASR, LLM, and TTS default stack and why.
  • [ ] I can defend a p95 latency budget with stage names.
  • [ ] I know when an end-to-end voice model is the better abstraction.