01. Week 18 — Voice & Realtime AI¶
Key concepts to master¶
- Streaming ASR vs batch ASR: partial transcripts, stable prefixes, finalization delay.
- Whisper architecture: spectrogram → encoder-decoder → text and timestamps.
- Word-level timestamps: alignment, debugging, captions, and interruption analysis.
- VAD: speech presence detection at frame level.
- Endpointing: deciding the user actually finished the turn.
- Neural TTS: natural speech, streaming synthesis, and first-audio latency.
- Voice cloning: consent, disclosure, abuse prevention, and policy checks.
- Prosody: rhythm, stress, and emotional tone as product behavior.
- WebSocket architecture: persistent bidirectional transport for chunks and events.
- Latency budgeting: stage targets, p95 thinking, and instrumentation discipline.
- Barge-in: cancel speaking fast and resume listening faster.
- Speech-to-speech models: when native voice models beat modular stacks.
- Telephony constraints: 8 kHz audio, noise, and more punishing latency.
- Foundation audit: streaming, WebSockets, latency math, and audio basics.
🧠 Mental models¶
- Streaming ASR vs batch ASR: "a live captioner versus a finished transcript after the meeting ends"
- VAD: "a motion sensor for speech energy"
- Endpointing: "the referee deciding the speaker is truly done"
- Barge-in: "walkie-talkie etiquette—stop talking the instant the other side cuts in"
- Realtime transport: "a relay race passing audio chunks and events between runners"
- Latency budgeting: "a pit-stop clock where every stage gets milliseconds, not vibes"
⚠️ Common traps¶
- Treating Whisper as streaming-native when chunking, stabilization, and finalization logic still matter.
- Confusing VAD with endpointing and causing either clipped users or awkward dead air.
- Optimizing average latency while p95 turn latency still makes the product feel broken.
- Failing to log first-token, first-audio, partial-transcript stability, and interruption timing.
- Ignoring telephony realities like 8 kHz audio, packet loss, echo, and noisy channels.
- Using voice cloning without consent, disclosure, and abuse-prevention controls.
🔗 Prerequisites & connections¶
Builds on: Module 17 serving, monitoring, rollout, and production discipline, plus earlier model-selection and evaluation habits from the rest of the track.
Feeds into: system-design modules where realtime architectures, QoS trade-offs, and human-perceived latency become first-class design constraints.
💬 Interview phrasing¶
- Why does a slow voice turn feel worse than an equally slow text reply?
- What is the practical difference between VAD and endpointing in a voice agent?
- How would you design a low-latency voice pipeline over WebSockets or WebRTC?
- When would you choose cascaded ASR → LLM → TTS instead of a speech-to-speech model?
- Which metrics do you inspect first when users complain about interruptions or sluggishness?
⏱️ Difficulty markers¶
- 🟢 VAD basics
- 🟡 streaming ASR vs batch ASR
- 🟡 endpointing and barge-in
- 🟡 latency budgeting
- 🔴 WebRTC / realtime orchestration
- 🔴 speech-to-speech model selection
Self-check questions¶
- Why does a five-second voice turn feel worse than a five-second text turn? See 02_explainer.md §1-§2.
- Whisper is strong, but why is it not automatically a streaming-native solution? See 02_explainer.md §3.
- VAD and endpointing sound similar. What is the difference? See 02_explainer.md §3 and 03_study_material.md §4.
- What does word-level timestamping buy you in production? See 02_explainer.md §3.
- Why do voice teams care about first-audio latency more than total synthesis time? See 02_explainer.md §4.
- Describe a relay-race pipeline over WebSockets in six lines. See 02_explainer.md §5.
- What metrics must be logged for stage-by-stage latency debugging? See 02_explainer.md §5 and 05_hands_on_lab.md.
- When would you pick GPT-4o Realtime or another end-to-end model? See 02_explainer.md §6.
- What are the honest limitations of voice AI today? See 02_explainer.md §8.
- Which foundation gap is still yours: streaming, WebSockets, latency budgeting, or audio basics? See 02_explainer.md §9.
Health check¶
By end of Week 18 you should have:
- [ ] A working voice agent or a faithful mocked pipeline with real instrumentation
- [ ] A written latency budget with measured p50 and p95 values
- [ ] A confident explanation of VAD, endpointing, and barge-in
- [ ] A clear view of when to choose cascaded versus end-to-end voice models
- [ ] A final AI engineering module closure note pointing you back to learning/README.md