07. Interruption and Barge-In — when the user talks over the assistant¶

~16 min read. Real conversations are messy, so your pipeline must stay polite under collision.

Built on the ELI5 in 00-eli5.md. The voice — the part currently speaking — must stop gracefully when the ear hears the user jump back into the relay race.

First picture: humans interrupt all the time¶

Picture a live interpreter at the United Nations. The translated voice is speaking. Then the delegate cuts in mid-sentence. A skilled interpreter stops, listens, and reorients immediately. A bad system keeps talking over the human. That feels rude, robotic, and unusable. So barge-in matters. Barge-in means the user speaks while the assistant is still talking.

Humans do this constantly. They correct themselves. They ask, "No, wait." They change direction. They answer before the assistant finishes. Simple, no?

assistant audio playing
        │
        ▼
┌──────────────┐     user speech detected     ┌──────────────┐
│  the voice   │─────────────────────────────▶│   the ear    │
└──────┬───────┘                              └──────┬───────┘
       │                                             │
       ├── stop TTS now                              │
       ├── cancel brain response                     │
       └──────────── resume listening ◀──────────────┘

The relay race is no longer one clean lane. Now two runners collide. Your job is to resolve the collision fast. If you delay, the awkward pause becomes an awkward overlap. That is even worse.

What the pipeline should do immediately¶

When barge-in happens, the reaction must be ordered. Do not improvise with random callbacks. Use a consistent sequence.

Detect fresh user speech.
Stop TTS immediately.
Cancel the in-flight response if it is no longer useful.
Resume ASR with priority.
Track exactly what the user actually said. Look. The first action is not reasoning. The first action is silence. If the voice keeps talking while the ear hears the user, your product sounds deaf. A practical event flow looks like this.

Step	Event	Why it happens
1	`client.vad.state = speech_started`	The ear or client VAD notices the human is back
2	`tts.stop`	The voice must stop audio output immediately
3	`llm.cancel`	The brain should stop spending tokens on stale intent
4	`asr.resume`	The ear returns to active hearing mode
5	`turn.capture`	The orchestrator records the new utterance for the next turn
Yes? This order protects user intent. Stopping TTS first preserves dignity.
Cancelling the brain protects cost. Resuming the ear protects meaning. The
relay race restarts from the correct baton.

Turn-taking choreography is not one-size-fits-all¶

Here is where many teams get surprised. The same threshold does not fit every speaker. Fast speakers leave tiny gaps. Slow speakers leave wider gaps. Some users think aloud while hesitating. Some users attack the question directly. So what to do? Tune turn-taking thresholds with real speech behavior in mind.

Fast speakers usually need shorter end-of-turn thresholds. Otherwise the system keeps waiting, and the awkward pause grows. Slow speakers usually need more patience. Otherwise the ear finalizes too early, and the brain answers before the user is done. That feels interruptive in the other direction.

fast speaker
word ─ word ─ word  gap  word
                 ▲
          short threshold works

slow speaker
word ───── gap ───── word
           ▲
     short threshold clips thought

A strong system can adapt. It can use speaker history, channel type, or interaction mode. A phone bot may need different timing than a browser demo. A tutoring app may tolerate longer reflection than a concierge bot. The ear, the brain, and the voice all benefit when turn-taking fits the human.

Use an explicit state machine or invite bugs¶

Now we reach the engineering heart. Race conditions hide inside ad hoc callbacks. One callback says, "stop audio." Another says, "flush queued chunks." A third says, "resume listening." A fourth still sends old tokens to TTS. Then chaos begins. So use explicit states. Name them. Guard transitions. A clean starter machine is this.

┌──────────────┐
│  LISTENING   │
└──────┬───────┘
       │ speech ends
       ▼
┌──────────────┐
│TRANSCRIBING  │
└──────┬───────┘
       │ transcript ready
       ▼
┌──────────────┐
│   THINKING   │
└──────┬───────┘
       │ first audio ready
       ▼
┌──────────────┐
│   SPEAKING   │
└──────┬───────┘
       │ user barges in
       ▼
┌──────────────┐
│ INTERRUPTED  │
└──────┬───────┘
       │ cleanup complete
       ▼
┌──────────────┐
│  LISTENING   │
└──────────────┘

Each state should answer two questions clearly. What events are allowed here? What resources are active here? For example, in SPEAKING, TTS playback is active, ASR may be in reduced or duplex mode, and new user speech must trigger interruption logic. In INTERRUPTED, old audio buffers should be draining or discarded,

old LLM output should be cancelled, and the ear should regain priority. See. Explicit states do not slow development. They prevent ghost behavior. Ghost behavior means old audio still playing, old tokens still arriving, or stale transcript fragments attaching to the new turn. That is how users lose trust.

Common bug patterns during barge-in¶

Let us name a few ugly failures.

Late stop bug: the voice receives stop, but two buffered chunks still play.
Zombie brain bug: the brain keeps generating after interruption, then writes stale text into logs or memory.
Split-turn bug: half of the new utterance attaches to the old turn, and the rest becomes a new turn.
Double-resume bug: ASR restarts twice, creating duplicate partial transcripts.
Phantom answer bug: cancelled content still reaches the voice after cleanup. Simple, no? Most of these bugs happen because ownership is fuzzy. Who owns cancellation? Who owns queued audio? Who confirms the new listening state? The orchestrator should answer those questions, not your hope. The relay race needs a referee when runners collide.

A healthy interruption policy¶

A mature team writes a policy, not just code. For example:

Stop assistant audio within a very small target, often under 100 milliseconds after speech detection.
Mark the active response as cancelled before generating more tokens.
Preserve whatever the ear actually heard, not what the assistant expected to hear.
Log interruption timestamps across the relay race.
Re-enter LISTENING only after cleanup is complete. Look. The awkward pause is not the only latency problem. After barge-in, you also care about interruption responsiveness. Users feel this even more sharply than first response speed. A slightly slower but well-behaved system can beat a faster rude one.

Where this lives in the wild¶

ChatGPT voice mode — realtime conversation engineer: must stop speaking instantly when the user cuts in with a correction.
Call-center billing bot — dialogue systems engineer: needs precise turn capture so an interrupted verification step does not corrupt records.
Language tutoring app — speech UX engineer: adapts thresholds because nervous learners pause differently from fluent speakers.
In-car copilot — embedded voice engineer: handles barge-in aggressively because safety conversations cannot wait for polite completion.
Hospital triage voice assistant — reliability engineer: uses explicit states so interruptions do not mix symptoms across turns.

Pause and recall¶

What should happen first when the user barges in during assistant speech?
Why do fast and slow speakers need different turn-taking thresholds?
How does an explicit state machine reduce race conditions?
Which bug appears if stale audio or stale tokens survive after interruption?

Interview Q&A¶

Q: What is barge-in in a voice system? A: It is the case where the user starts speaking while the assistant is still speaking, forcing immediate interruption handling and turn recovery. Common wrong answer to avoid: "It just means the user pressed a stop button." Q: Why should TTS stop before deeper reasoning happens? A: Because the first job is to stop talking over the user, which protects conversational quality before backend cleanup finishes. Common wrong answer to avoid: "Finish the current sentence, then see whether the user still matters." Q: Why is a state machine better than scattered callbacks for barge-in? A: Because state machines make legal transitions and active resources explicit, which prevents stale audio, duplicate ASR, and zombie generation. Common wrong answer to avoid: "Callbacks are fine as long as there are enough of them." Q: How should threshold tuning differ across speakers? A: Fast speakers usually need shorter patience windows, while slow reflective speakers need longer ones so the system does not cut them off early. Common wrong answer to avoid: "Use one universal silence threshold for everyone."

Apply now (5 min)¶

Exercise. Write the five-step interruption reaction for a voice bot. Then add one failure mode if step two happens late.

Sketch from memory. Draw the state machine from LISTENING to INTERRUPTED and back. Label where the ear, the brain, and the voice lose or regain control. Mark where the awkward pause turns into overlap pain.

Bridge. Cascaded pipelines handle barge-in with explicit coordination. End-to-end voice models handle the same problem with a different abstraction. → 08-end-to-end-voice-models.md