Exercise 13 — Voice Agent Loop¶

Timebox: 60-90 minutes

Goal¶

Implement the core control loop of a voice agent: VAD → streaming STT → LLM → streaming TTS, with end-of-turn detection and barge-in. Mock the providers — the discipline is the pipeline structure and the timing logic.

Work in¶

voice_agent.py

Tasks¶

Async pipeline with four stages running concurrently (asyncio.Queue between them).
Mock vad, stt, llm, tts coroutines that produce/consume frames with realistic per-stage latency (e.g., 50ms VAD chunks, 100ms STT chunks, 200ms LLM TTFT, 150ms TTS first audio).
End-of-turn detection: hybrid silence threshold + a stub semantic check.
Barge-in: when VAD detects user speech while TTS is producing, drop the TTS buffer and cancel the LLM in-flight task.
Per-turn metrics: t0..t5 (see assignment.md for Module 18) logged to a JSON list and aggregated.

Done when¶

A simulated 5-turn conversation runs end-to-end
One turn includes a barge-in scenario; the metrics show the TTS was cut off
One turn includes a slow speaker (long pause mid-utterance) that doesn't trigger early end-of-turn
The latency report prints stage-level and end-to-end p50/p95

Stretch¶

Replace one mock with a real provider (Deepgram or Cartesia) and re-run
Add OpenTelemetry spans per stage
Add a "test scenario" runner that replays scripted user utterances