06. Module 18 Review — Voice & Realtime AI¶
Focus: ASR, TTS, VAD, endpointing, WebSockets, latency budgets, barge-in, speech-to-speech models, and the final closure of the AI engineering curriculum.
Review loop¶
- Re-answer the self-checks in 01_weekly_plan.md from memory.
- Use 04_daily_recall.md to explain the ear, the brain, the voice, the relay race, and the awkward pause without notes.
- Re-read 02_explainer.md §5-§10 and restate your latency budget aloud.
- Re-open 05_hands_on_lab.md and say what your slowest stage was, why, and what you would optimize next.
Reflection prompts¶
- Which voice-system decision still feels fuzzy: ASR choice, endpointing, TTS, barge-in, or end-to-end model choice?
- Which failure mode surprised you most once you imagined or built the pipeline?
- Where did this module borrow from earlier modules like agents, evals, and MLOps?
- If you had one more week, would you improve quality, speed, or observability first?
- What would you say differently now in a “design a voice agent” interview answer?
Embedded checkpoint¶
Conceptual¶
- Why does voice latency feel socially harsher than text latency? See 02_explainer.md §1-§2.
- Explain Whisper at a high level and say why streaming wrappers still need careful engineering. See 02_explainer.md §3.
- Separate VAD, endpointing, and word-level timestamps cleanly. See 02_explainer.md §3.
- Why does first-audio latency matter more than total TTS time? See 02_explainer.md §4.
- Draw the relay-race pipeline and explain TTFT, TTFA, and barge-in. See 02_explainer.md §5.
- When would you choose a native speech-to-speech model over a cascaded pipeline? See 02_explainer.md §6.
- Name two honest limitations of present-day voice AI. See 02_explainer.md §8.
- Which foundation gap still needs work for you, and how will you patch it? See 02_explainer.md §9.
Applied¶
- Design a browser-based support voice agent and assign a p95 budget to each stage.
- Your agent interrupts thoughtful speakers too early. What changes do you test first?
- Your users say the agent feels slow, but STT accuracy is good. Where do you look next?
- A founder wants voice cloning in production next week. What must be true before launch?
- Telephony rollout doubled complaints. What changed technically, and how do you debug it?
Self-evaluation¶
| Section | Score | / |
|---|---|---|
| Conceptual | __ | 16 |
| Applied | __ | 10 |
| Total | __ | 26 |
Completion gate¶
- [ ] 01_weekly_plan.md completed
- [ ] 02_explainer.md read end to end
- [ ] 03_study_material.md reviewed with notes
- [ ] 05_hands_on_lab.md shipped or faithfully mocked with instrumentation
- [ ] Stage-level latency budget written down with p50 and p95
- [ ] Can explain VAD, endpointing, barge-in, and speech-to-speech tradeoffs cold
- [ ] Ready to return to
learning/README.mdfor the system design track and coding exercises
Final bridge¶
This completes the AI engineering curriculum. Return to learning/README.md for the system design track and coding exercises, then keep the voice-latency mindset with you when you study larger distributed systems.