01. Week 18 — Voice & Realtime AI¶

Key concepts to master¶

Streaming ASR vs batch ASR: partial transcripts, stable prefixes, finalization delay.
Whisper architecture: spectrogram → encoder-decoder → text and timestamps.
Word-level timestamps: alignment, debugging, captions, and interruption analysis.
VAD: speech presence detection at frame level.
Endpointing: deciding the user actually finished the turn.
Neural TTS: natural speech, streaming synthesis, and first-audio latency.
Voice cloning: consent, disclosure, abuse prevention, and policy checks.
Prosody: rhythm, stress, and emotional tone as product behavior.
WebSocket architecture: persistent bidirectional transport for chunks and events.
Latency budgeting: stage targets, p95 thinking, and instrumentation discipline.
Barge-in: cancel speaking fast and resume listening faster.
Speech-to-speech models: when native voice models beat modular stacks.
Telephony constraints: 8 kHz audio, noise, and more punishing latency.
Foundation audit: streaming, WebSockets, latency math, and audio basics.

🧠 Mental models¶

Streaming ASR vs batch ASR: "a live captioner versus a finished transcript after the meeting ends"
VAD: "a motion sensor for speech energy"
Endpointing: "the referee deciding the speaker is truly done"
Barge-in: "walkie-talkie etiquette—stop talking the instant the other side cuts in"
Realtime transport: "a relay race passing audio chunks and events between runners"
Latency budgeting: "a pit-stop clock where every stage gets milliseconds, not vibes"

⚠️ Common traps¶

Treating Whisper as streaming-native when chunking, stabilization, and finalization logic still matter.
Confusing VAD with endpointing and causing either clipped users or awkward dead air.
Optimizing average latency while p95 turn latency still makes the product feel broken.
Failing to log first-token, first-audio, partial-transcript stability, and interruption timing.
Ignoring telephony realities like 8 kHz audio, packet loss, echo, and noisy channels.
Using voice cloning without consent, disclosure, and abuse-prevention controls.

🔗 Prerequisites & connections¶

Builds on: Module 17 serving, monitoring, rollout, and production discipline, plus earlier model-selection and evaluation habits from the rest of the track.

Feeds into: system-design modules where realtime architectures, QoS trade-offs, and human-perceived latency become first-class design constraints.

💬 Interview phrasing¶

Why does a slow voice turn feel worse than an equally slow text reply?
What is the practical difference between VAD and endpointing in a voice agent?
How would you design a low-latency voice pipeline over WebSockets or WebRTC?
When would you choose cascaded ASR → LLM → TTS instead of a speech-to-speech model?
Which metrics do you inspect first when users complain about interruptions or sluggishness?

⏱️ Difficulty markers¶

🟢 VAD basics
🟡 streaming ASR vs batch ASR
🟡 endpointing and barge-in
🟡 latency budgeting
🔴 WebRTC / realtime orchestration
🔴 speech-to-speech model selection

Self-check questions¶

Why does a five-second voice turn feel worse than a five-second text turn? See 02_explainer.md §1-§2.
Whisper is strong, but why is it not automatically a streaming-native solution? See 02_explainer.md §3.
VAD and endpointing sound similar. What is the difference? See 02_explainer.md §3 and 03_study_material.md §4.
What does word-level timestamping buy you in production? See 02_explainer.md §3.
Why do voice teams care about first-audio latency more than total synthesis time? See 02_explainer.md §4.
Describe a relay-race pipeline over WebSockets in six lines. See 02_explainer.md §5.
What metrics must be logged for stage-by-stage latency debugging? See 02_explainer.md §5 and 05_hands_on_lab.md.
When would you pick GPT-4o Realtime or another end-to-end model? See 02_explainer.md §6.
What are the honest limitations of voice AI today? See 02_explainer.md §8.
Which foundation gap is still yours: streaming, WebSockets, latency budgeting, or audio basics? See 02_explainer.md §9.

Health check¶

By end of Week 18 you should have: - [ ] A working voice agent or a faithful mocked pipeline with real instrumentation - [ ] A written latency budget with measured p50 and p95 values - [ ] A confident explanation of VAD, endpointing, and barge-in - [ ] A clear view of when to choose cascaded versus end-to-end voice models - [ ] A final AI engineering module closure note pointing you back to learning/README.md