Skip to content

Exercise 13 — Voice Agent Loop

Timebox: 60-90 minutes

Goal

Implement the core control loop of a voice agent: VAD → streaming STT → LLM → streaming TTS, with end-of-turn detection and barge-in. Mock the providers — the discipline is the pipeline structure and the timing logic.

Work in

  • voice_agent.py

Tasks

  1. Async pipeline with four stages running concurrently (asyncio.Queue between them).
  2. Mock vad, stt, llm, tts coroutines that produce/consume frames with realistic per-stage latency (e.g., 50ms VAD chunks, 100ms STT chunks, 200ms LLM TTFT, 150ms TTS first audio).
  3. End-of-turn detection: hybrid silence threshold + a stub semantic check.
  4. Barge-in: when VAD detects user speech while TTS is producing, drop the TTS buffer and cancel the LLM in-flight task.
  5. Per-turn metrics: t0..t5 (see assignment.md for Module 18) logged to a JSON list and aggregated.

Done when

  • A simulated 5-turn conversation runs end-to-end
  • One turn includes a barge-in scenario; the metrics show the TTS was cut off
  • One turn includes a slow speaker (long pause mid-utterance) that doesn't trigger early end-of-turn
  • The latency report prints stage-level and end-to-end p50/p95

Stretch

  • Replace one mock with a real provider (Deepgram or Cartesia) and re-run
  • Add OpenTelemetry spans per stage
  • Add a "test scenario" runner that replays scripted user utterances