Exercise 13 — Voice Agent Loop¶
Timebox: 60-90 minutes
Goal¶
Implement the core control loop of a voice agent: VAD → streaming STT → LLM → streaming TTS, with end-of-turn detection and barge-in. Mock the providers — the discipline is the pipeline structure and the timing logic.
Work in¶
voice_agent.py
Tasks¶
- Async pipeline with four stages running concurrently (
asyncio.Queuebetween them). - Mock
vad,stt,llm,ttscoroutines that produce/consume frames with realistic per-stage latency (e.g., 50ms VAD chunks, 100ms STT chunks, 200ms LLM TTFT, 150ms TTS first audio). - End-of-turn detection: hybrid silence threshold + a stub semantic check.
- Barge-in: when VAD detects user speech while TTS is producing, drop the TTS buffer and cancel the LLM in-flight task.
- Per-turn metrics:
t0..t5(see assignment.md for Module 18) logged to a JSON list and aggregated.
Done when¶
- A simulated 5-turn conversation runs end-to-end
- One turn includes a barge-in scenario; the metrics show the TTS was cut off
- One turn includes a slow speaker (long pause mid-utterance) that doesn't trigger early end-of-turn
- The latency report prints stage-level and end-to-end p50/p95
Stretch¶
- Replace one mock with a real provider (Deepgram or Cartesia) and re-run
- Add OpenTelemetry spans per stage
- Add a "test scenario" runner that replays scripted user utterances