05. Assignment 9 — End-to-End Voice Agent¶

Week 18. Build a streaming voice agent that feels conversational, not batchy. Use 02_explainer.md for the mental model and 03_study_material.md for vendor and protocol notes.

Goal¶

Be able to say, with evidence, “I built a voice agent, measured the latency budget, and can explain why it feels fast or slow.”

Before you build¶

Read these first: 1. 02_explainer.md §3-§6. 2. 03_study_material.md §2-§8. 3. 06_revision.md once, so you know the end-state you must defend.

Required system shape¶

mic_in
  -> VAD
  -> streaming STT
  -> endpointing
  -> LLM
  -> streaming TTS
  -> speaker_out

The design must support overlapping stages, not purely serial execution.

Implementation tracks¶

Option A — Framework-first¶

Use LiveKit Agents, Pipecat, or another voice-agent framework.

Option B — From-scratch orchestrator¶

Build the pipeline yourself with async tasks, WebSockets, and mock or real providers.

Both are acceptable if the latency instrumentation is real.

Required components¶

1. ASR / the ear¶

Streaming STT, not offline batch transcription.
Partial transcript handling with stable-prefix awareness.
Word-level timestamps if your provider exposes them.

2. Endpointing¶

Start with a silence threshold around 600-800 ms.
Add a lightweight semantic confirmation if possible.
Document one case where pure silence would fail.

3. LLM / the brain¶

Use a fast model or a realistic mock.
Log time to first token.
Keep the first response clause short enough for early TTS.

4. TTS / the voice¶

Use streaming TTS or a realistic streaming mock.
Measure time to first audio.
Support interruption when the user barges in.

5. Barge-in¶

When the user speaks over the assistant: 1. Detect the event quickly. 2. Stop playback immediately. 3. Cancel or deprioritize the in-flight response. 4. Resume listening and transcription. 5. Record whether the interrupted response was committed or abandoned.

Latency logging requirements¶

Log these timestamps per turn: - t_end_of_turn - t_stt_final - t_llm_ttft - t_llm_final - t_tts_first_audio - t_playback_start - t_playback_end

Report p50 and p95 for each stage, plus overall t_playback_start - t_end_of_turn.

Deliverables¶

run.sh or equivalent one-command startup.
pipeline.py or framework configuration.
metrics.py or logging layer.
README.md with architecture diagram and defended latency budget.
LATENCY.md or a latency section in the README.
Short demo recording or GIF if feasible.

Definition of done¶

[ ] The pipeline runs end to end with real or mocked providers.
[ ] Streaming stages overlap in a believable way.
[ ] One barge-in scenario is demonstrated.
[ ] p50 and p95 numbers are reported.
[ ] Endpointing strategy is explained, not hand-waved.
[ ] The README names the slowest stage and what you would optimize next.

Suggested target numbers¶

Good browser p95 turn-to-first-audio: 700-1200 ms.
Acceptable early project target: under 1500 ms.
Telephony or noisy-mobile flows may be worse, but explain why.

Common pitfalls¶

Building a serial pipeline and calling it streaming.
Treating every partial transcript as final truth.
Ignoring barge-in until the end.
Reporting only average latency.
Forgetting that telephony audio is harsher than browser audio.
Shipping voice cloning without policy and consent controls.

Writeup must answer¶

Why did you choose this ASR, LLM, and TTS combination?
Where did your latency budget go, stage by stage?
How did you detect turn completion?
What happens on barge-in?
When would you switch to an end-to-end voice model instead?

Interview retell template¶

“I built a voice agent with streaming STT, explicit endpointing, a fast LLM, and streaming TTS.

I measured p50 and p95 at each stage.

The hardest problem was not raw transcription.

It was turn-taking and interruption handling.

My slowest stage was [X], and the first optimization I would try next is [Y].”

Why this hands_on_lab matters¶

This is the final AI engineering module. After shipping this, you should be able to discuss voice systems with the same confidence you now bring to RAG, agents, evals, and production serving.