Skip to content

01. Week 18 — Voice & Realtime AI

Key concepts to master

  • Streaming ASR vs batch ASR: partial transcripts, stable prefixes, finalization delay.
  • Whisper architecture: spectrogram → encoder-decoder → text and timestamps.
  • Word-level timestamps: alignment, debugging, captions, and interruption analysis.
  • VAD: speech presence detection at frame level.
  • Endpointing: deciding the user actually finished the turn.
  • Neural TTS: natural speech, streaming synthesis, and first-audio latency.
  • Voice cloning: consent, disclosure, abuse prevention, and policy checks.
  • Prosody: rhythm, stress, and emotional tone as product behavior.
  • WebSocket architecture: persistent bidirectional transport for chunks and events.
  • Latency budgeting: stage targets, p95 thinking, and instrumentation discipline.
  • Barge-in: cancel speaking fast and resume listening faster.
  • Speech-to-speech models: when native voice models beat modular stacks.
  • Telephony constraints: 8 kHz audio, noise, and more punishing latency.
  • Foundation audit: streaming, WebSockets, latency math, and audio basics.

🧠 Mental models

  • Streaming ASR vs batch ASR: "a live captioner versus a finished transcript after the meeting ends"
  • VAD: "a motion sensor for speech energy"
  • Endpointing: "the referee deciding the speaker is truly done"
  • Barge-in: "walkie-talkie etiquette—stop talking the instant the other side cuts in"
  • Realtime transport: "a relay race passing audio chunks and events between runners"
  • Latency budgeting: "a pit-stop clock where every stage gets milliseconds, not vibes"

⚠️ Common traps

  • Treating Whisper as streaming-native when chunking, stabilization, and finalization logic still matter.
  • Confusing VAD with endpointing and causing either clipped users or awkward dead air.
  • Optimizing average latency while p95 turn latency still makes the product feel broken.
  • Failing to log first-token, first-audio, partial-transcript stability, and interruption timing.
  • Ignoring telephony realities like 8 kHz audio, packet loss, echo, and noisy channels.
  • Using voice cloning without consent, disclosure, and abuse-prevention controls.

🔗 Prerequisites & connections

Builds on: Module 17 serving, monitoring, rollout, and production discipline, plus earlier model-selection and evaluation habits from the rest of the track.

Feeds into: system-design modules where realtime architectures, QoS trade-offs, and human-perceived latency become first-class design constraints.

💬 Interview phrasing

  • Why does a slow voice turn feel worse than an equally slow text reply?
  • What is the practical difference between VAD and endpointing in a voice agent?
  • How would you design a low-latency voice pipeline over WebSockets or WebRTC?
  • When would you choose cascaded ASR → LLM → TTS instead of a speech-to-speech model?
  • Which metrics do you inspect first when users complain about interruptions or sluggishness?

⏱️ Difficulty markers

  • 🟢 VAD basics
  • 🟡 streaming ASR vs batch ASR
  • 🟡 endpointing and barge-in
  • 🟡 latency budgeting
  • 🔴 WebRTC / realtime orchestration
  • 🔴 speech-to-speech model selection

Self-check questions

  1. Why does a five-second voice turn feel worse than a five-second text turn? See 02_explainer.md §1-§2.
  2. Whisper is strong, but why is it not automatically a streaming-native solution? See 02_explainer.md §3.
  3. VAD and endpointing sound similar. What is the difference? See 02_explainer.md §3 and 03_study_material.md §4.
  4. What does word-level timestamping buy you in production? See 02_explainer.md §3.
  5. Why do voice teams care about first-audio latency more than total synthesis time? See 02_explainer.md §4.
  6. Describe a relay-race pipeline over WebSockets in six lines. See 02_explainer.md §5.
  7. What metrics must be logged for stage-by-stage latency debugging? See 02_explainer.md §5 and 05_hands_on_lab.md.
  8. When would you pick GPT-4o Realtime or another end-to-end model? See 02_explainer.md §6.
  9. What are the honest limitations of voice AI today? See 02_explainer.md §8.
  10. Which foundation gap is still yours: streaming, WebSockets, latency budgeting, or audio basics? See 02_explainer.md §9.

Health check

By end of Week 18 you should have: - [ ] A working voice agent or a faithful mocked pipeline with real instrumentation - [ ] A written latency budget with measured p50 and p95 values - [ ] A confident explanation of VAD, endpointing, and barge-in - [ ] A clear view of when to choose cascaded versus end-to-end voice models - [ ] A final AI engineering module closure note pointing you back to learning/README.md