Skip to content

Voice Agent Loop — Analysis

The pipeline

mic_q  ──→ VAD  ──→  STT       ──→  (audio out)
              └─→  end_of_turn  ──→  LLM  ──→  TTS  ──→ audio_out_q
                                      ↑          ↑
                                      └── cancel_event (barge-in)

Five stages running concurrently as asyncio tasks, communicating via asyncio.Queue:

  1. VAD. Receives raw audio frames; classifies as USER_AUDIO (speech) or USER_SILENCE; forwards to STT and end-of-turn.
  2. STT. Consumes audio frames; emits USER_PARTIAL (streaming partials).
  3. End-of-turn. Hybrid silence-threshold + buffered-text detector; emits USER_FINAL when both signals agree.
  4. LLM. Consumes USER_FINAL; streams AGENT_TOKEN frames. Cancellable via cancel_event.
  5. TTS. Consumes AGENT_TOKEN; emits AGENT_AUDIO frames. Cancellable.

The pipeline is the structural answer to the "voice loop" question. Each stage has a clear input, output, and cancellation behaviour.

The six timing points

Marker What it records
t0 end-of-user-speech detected (silence threshold crossed)
t1 STT finalised the transcript
t2 LLM produced first token
t3 LLM produced last token
t4 TTS produced first audio frame
t5 TTS finished outputting audio

Derived metrics:

  • STT latency = t1 − t0
  • Time to first token (TTFT) = t2 − t1
  • First-audio latency = t4 − t1 (user-perceived "how long until the agent starts speaking")
  • End-to-end = t5 − t0 ("total time from when I stopped talking to when the agent stopped talking")

Production voice agents publish these as histograms; SLOs are typically set on first-audio latency (the user-perceived metric).

End-of-turn detection

The trickiest part of voice agents. Too eager → cut user off mid-sentence (a "premature finalisation"). Too patient → awkward silence before the agent responds.

This implementation uses a silence threshold (300 ms accelerated to 30 ms in the test). Production systems combine:

  • Silence threshold. Hard signal; user stopped making sound.
  • Semantic completeness. A small classifier or lightweight LLM asks "is this utterance complete?"
  • Energy threshold. Audio energy below ambient noise.

Hybrid systems require BOTH silence AND semantic confidence to fire USER_FINAL. The naive silence-only is what causes most voice-agent UX complaints.

Barge-in handling

The agent is speaking; the user starts talking. The agent should stop, listen, and respond to the new input.

This implementation uses a shared cancel_event:

  • A barge_in_monitor (sketched) watches for USER_AUDIO arriving while AGENT_AUDIO is flowing.
  • On detection, cancel_event.set().
  • LLM and TTS check the event between tokens; abort if set.

In production, this needs:

  • Acoustic echo cancellation so the agent's own output doesn't trigger barge-in.
  • A latency budget — too-eager barge-in detects the user's "uh huh" as interruption.
  • Half-duplex vs. full-duplex mode (full duplex is harder).

What the mock latencies represent

The constants in the file are 10× accelerated from realistic production values:

Constant Code value Real value
VAD_CHUNK_MS 5 ~50
STT_CHUNK_MS 10 ~100
LLM_TTFT_MS 20 ~200-500
LLM_TPS_MS 3 ~30 (depending on model and TPS)
TTS_FIRST_AUDIO_MS 15 ~150 (Cartesia, Deepgram)

Acceleration keeps the test suite under 3 seconds. To run with realistic values, change the constants.

Run output

Turn metrics:
  turn 1: user='what time is it' agent='You said: what time is it. Here is a longer reply.' e2e_ms=...
  turn 2: user='and the weather please' agent='' e2e_ms=...
  turn 3: user='actually nevermind' agent='You said: actually nevermind. Here is a longer rep...' e2e_ms=...

Summary: {'turns': 3, 'bargein_turns': 0, 'stt': {...}, 'ttft': {...}, 'first_audio': {...}, 'end_to_end': {...}}

The summary shows the timing distribution; p50 and p95 per stage. Production teams use these to set SLOs and to investigate regressions per stage.

What this implementation deliberately skips

  • Real providers. Mocks throughout. Real integration would replace VAD with Silero or webrtcvad; STT with Deepgram or Whisper streaming; TTS with Cartesia or ElevenLabs.
  • Acoustic echo cancellation. Required for full-duplex.
  • Audio I/O. No actual microphone or speaker; the scenarios are scripted frames.
  • Network resilience. Real providers are over WebSockets; retries, reconnects, partial-message handling all matter.
  • Multi-language. The mock is English-only.

Interview probes

  • "Walk through the stages of a voice agent and what each does."
  • "How does end-of-turn detection work, and why is it hard?"
  • "What is barge-in, and how do you handle it without false positives?"
  • "What latency metrics matter for voice UX?"
  • "How would you reduce time-to-first-audio?"
  • "How does the pipeline structure differ from a chat agent?"

Each has a paragraph answer rooted in the structure of this implementation.