Voice Agent Loop — Analysis¶
The pipeline¶
mic_q ──→ VAD ──→ STT ──→ (audio out)
↓
└─→ end_of_turn ──→ LLM ──→ TTS ──→ audio_out_q
↑ ↑
└── cancel_event (barge-in)
Five stages running concurrently as asyncio tasks, communicating via asyncio.Queue:
- VAD. Receives raw audio frames; classifies as
USER_AUDIO(speech) orUSER_SILENCE; forwards to STT and end-of-turn. - STT. Consumes audio frames; emits
USER_PARTIAL(streaming partials). - End-of-turn. Hybrid silence-threshold + buffered-text detector; emits
USER_FINALwhen both signals agree. - LLM. Consumes
USER_FINAL; streamsAGENT_TOKENframes. Cancellable viacancel_event. - TTS. Consumes
AGENT_TOKEN; emitsAGENT_AUDIOframes. Cancellable.
The pipeline is the structural answer to the "voice loop" question. Each stage has a clear input, output, and cancellation behaviour.
The six timing points¶
| Marker | What it records |
|---|---|
| t0 | end-of-user-speech detected (silence threshold crossed) |
| t1 | STT finalised the transcript |
| t2 | LLM produced first token |
| t3 | LLM produced last token |
| t4 | TTS produced first audio frame |
| t5 | TTS finished outputting audio |
Derived metrics:
- STT latency = t1 − t0
- Time to first token (TTFT) = t2 − t1
- First-audio latency = t4 − t1 (user-perceived "how long until the agent starts speaking")
- End-to-end = t5 − t0 ("total time from when I stopped talking to when the agent stopped talking")
Production voice agents publish these as histograms; SLOs are typically set on first-audio latency (the user-perceived metric).
End-of-turn detection¶
The trickiest part of voice agents. Too eager → cut user off mid-sentence (a "premature finalisation"). Too patient → awkward silence before the agent responds.
This implementation uses a silence threshold (300 ms accelerated to 30 ms in the test). Production systems combine:
- Silence threshold. Hard signal; user stopped making sound.
- Semantic completeness. A small classifier or lightweight LLM asks "is this utterance complete?"
- Energy threshold. Audio energy below ambient noise.
Hybrid systems require BOTH silence AND semantic confidence to fire USER_FINAL. The naive silence-only is what causes most voice-agent UX complaints.
Barge-in handling¶
The agent is speaking; the user starts talking. The agent should stop, listen, and respond to the new input.
This implementation uses a shared cancel_event:
- A
barge_in_monitor(sketched) watches forUSER_AUDIOarriving whileAGENT_AUDIOis flowing. - On detection,
cancel_event.set(). - LLM and TTS check the event between tokens; abort if set.
In production, this needs:
- Acoustic echo cancellation so the agent's own output doesn't trigger barge-in.
- A latency budget — too-eager barge-in detects the user's "uh huh" as interruption.
- Half-duplex vs. full-duplex mode (full duplex is harder).
What the mock latencies represent¶
The constants in the file are 10× accelerated from realistic production values:
| Constant | Code value | Real value |
|---|---|---|
VAD_CHUNK_MS |
5 | ~50 |
STT_CHUNK_MS |
10 | ~100 |
LLM_TTFT_MS |
20 | ~200-500 |
LLM_TPS_MS |
3 | ~30 (depending on model and TPS) |
TTS_FIRST_AUDIO_MS |
15 | ~150 (Cartesia, Deepgram) |
Acceleration keeps the test suite under 3 seconds. To run with realistic values, change the constants.
Run output¶
Turn metrics:
turn 1: user='what time is it' agent='You said: what time is it. Here is a longer reply.' e2e_ms=...
turn 2: user='and the weather please' agent='' e2e_ms=...
turn 3: user='actually nevermind' agent='You said: actually nevermind. Here is a longer rep...' e2e_ms=...
Summary: {'turns': 3, 'bargein_turns': 0, 'stt': {...}, 'ttft': {...}, 'first_audio': {...}, 'end_to_end': {...}}
The summary shows the timing distribution; p50 and p95 per stage. Production teams use these to set SLOs and to investigate regressions per stage.
What this implementation deliberately skips¶
- Real providers. Mocks throughout. Real integration would replace VAD with Silero or webrtcvad; STT with Deepgram or Whisper streaming; TTS with Cartesia or ElevenLabs.
- Acoustic echo cancellation. Required for full-duplex.
- Audio I/O. No actual microphone or speaker; the scenarios are scripted frames.
- Network resilience. Real providers are over WebSockets; retries, reconnects, partial-message handling all matter.
- Multi-language. The mock is English-only.
Interview probes¶
- "Walk through the stages of a voice agent and what each does."
- "How does end-of-turn detection work, and why is it hard?"
- "What is barge-in, and how do you handle it without false positives?"
- "What latency metrics matter for voice UX?"
- "How would you reduce time-to-first-audio?"
- "How does the pipeline structure differ from a chat agent?"
Each has a paragraph answer rooted in the structure of this implementation.