Skip to content

06. Module 18 Review — Voice & Realtime AI

Focus: ASR, TTS, VAD, endpointing, WebSockets, latency budgets, barge-in, speech-to-speech models, and the final closure of the AI engineering curriculum.

Review loop

  1. Re-answer the self-checks in 01_weekly_plan.md from memory.
  2. Use 04_daily_recall.md to explain the ear, the brain, the voice, the relay race, and the awkward pause without notes.
  3. Re-read 02_explainer.md §5-§10 and restate your latency budget aloud.
  4. Re-open 05_hands_on_lab.md and say what your slowest stage was, why, and what you would optimize next.

Reflection prompts

  • Which voice-system decision still feels fuzzy: ASR choice, endpointing, TTS, barge-in, or end-to-end model choice?
  • Which failure mode surprised you most once you imagined or built the pipeline?
  • Where did this module borrow from earlier modules like agents, evals, and MLOps?
  • If you had one more week, would you improve quality, speed, or observability first?
  • What would you say differently now in a “design a voice agent” interview answer?

Embedded checkpoint

Conceptual

  1. Why does voice latency feel socially harsher than text latency? See 02_explainer.md §1-§2.
  2. Explain Whisper at a high level and say why streaming wrappers still need careful engineering. See 02_explainer.md §3.
  3. Separate VAD, endpointing, and word-level timestamps cleanly. See 02_explainer.md §3.
  4. Why does first-audio latency matter more than total TTS time? See 02_explainer.md §4.
  5. Draw the relay-race pipeline and explain TTFT, TTFA, and barge-in. See 02_explainer.md §5.
  6. When would you choose a native speech-to-speech model over a cascaded pipeline? See 02_explainer.md §6.
  7. Name two honest limitations of present-day voice AI. See 02_explainer.md §8.
  8. Which foundation gap still needs work for you, and how will you patch it? See 02_explainer.md §9.

Applied

  1. Design a browser-based support voice agent and assign a p95 budget to each stage.
  2. Your agent interrupts thoughtful speakers too early. What changes do you test first?
  3. Your users say the agent feels slow, but STT accuracy is good. Where do you look next?
  4. A founder wants voice cloning in production next week. What must be true before launch?
  5. Telephony rollout doubled complaints. What changed technically, and how do you debug it?

Self-evaluation

Section Score /
Conceptual __ 16
Applied __ 10
Total __ 26

Completion gate

Final bridge

This completes the AI engineering curriculum. Return to learning/README.md for the system design track and coding exercises, then keep the voice-latency mindset with you when you study larger distributed systems.