Skip to content

02. Getting audio off a phone network and back without stepping on the caller

~17 min read. The model is ready to listen. But the call is 8 kHz μ-law audio arriving in 20 ms chunks over a protocol designed in 1996, and the moment the bot speaks, the caller's "actually, wait—" has to be able to cut it off.

Built on 00-first-principles.md. This chapter opens the first integration seam — telephony media — and starts spending the turn budget. Barge-in here is the same turn-detection pressure that endpointing (chapter 03) and orchestration (chapter 04) inherit.

Note: the voice-agents module covers audio codecs, sample rates, and barge-in mechanics in depth. This chapter assumes that and focuses on the contact-center seam: forking a carrier-grade phone call into and out of an AI, the protocols that carry it, and how jitter and transfer mechanics constrain the bot.


What the stack map left unsaid

Chapter 01 drew the floor plan and marked where the AI inserts. It treated "the bot hears the caller and speaks back" as a single arrow. That arrow is a lie of omission. Between the caller's mouth and the model's ear sits a carrier network: a SIP session negotiating who-talks-to-whom, RTP packets carrying compressed audio across the public internet, jitter buffers smoothing out late packets, and codecs squeezing voice into 64 kbps or less. None of it was built for a machine listener.

To insert the AI you must fork the call — tap the live audio, send a copy to your ASR, and inject the bot's synthesized speech back into the same call — without the caller hearing a seam, and without breaking the carrier's expectations about how a call behaves. This chapter is that plumbing. By the end you can name how audio reaches your model (Twilio Media Streams, Amazon Connect KVS, Genesys AudioConnect, or a SIPREC fork), what it costs in latency before the model has done anything, and why barge-in is a media problem before it is a dialog problem.


What this file solves

A bot can have a perfect ASR and LLM and still feel broken because the audio path adds 150–300 ms before the model hears a word, drops packets that mangle a digit in an account number, or talks over a caller who is trying to interrupt. This file shows how a phone call is actually carried (SIP/RTP/WebRTC), how you fork it to the AI and inject speech back, and how to handle barge-in and jitter — so the bot's first ~200 ms of turn budget is spent on purpose, not by accident.


Why telephony refuses to behave like a microphone

When you build a voice demo on a laptop, audio is easy: a clean 16 kHz or 48 kHz PCM stream straight from a local mic, near-zero latency, no packet loss. A phone call is none of that. The audio originates from a carrier, often as 8 kHz μ-law (G.711) — telephone-quality, band-limited to ~300–3400 Hz, the reason voices sound "phone-ish." It travels as RTP (Real-time Transport Protocol) packets, typically one packet every 20 ms, over UDP, where packets can arrive late, out of order, or not at all. The session that sets all this up — codecs, ports, who streams to whom — is negotiated by SIP (Session Initiation Protocol). On the browser/app side, the same media may ride WebRTC with the Opus codec at higher quality.

Three consequences fall out of this immediately. The narrow 8 kHz band and μ-law compression cost you ASR accuracy — a model trained on wideband audio degrades on phone audio, which is why ASR vendors ship phone-tuned models. The 20 ms packet cadence plus a jitter buffer means audio reaches your model in small, slightly delayed chunks, not a continuous river. And because it's UDP, a lost packet is just gone — a 20 ms hole that can clip the difference between "fifteen" and "fifty" in a card amount.

Teacher voice. The first thing a senior engineer asks about a voice-AI stack is not "which LLM" — it is "how does the audio get to the model, and how many milliseconds does that path cost before the model has done anything?" Telephony is the part of the latency budget you cannot optimize away with a faster GPU.


The naive fork that adds half a second

The obvious way to get audio to the AI: have the CCaaS platform record the call, drop the file in storage, and send it to ASR. A smart person tries this first because it reuses the recording pipeline that already exists. It works perfectly — for analytics. For a live conversation it is catastrophic: you only get the audio after the caller stops, after the file is written, after it uploads. The bot can't respond until the whole turn is captured and shipped. That's the serial, "polite pipeline" failure — seconds of dead air.

The visible break: the caller finishes, then waits. And waits. The latency floor is the entire utterance length plus file write plus upload — easily 2–4 seconds before ASR even starts.

So the real problem is not "ASR is slow" and not "the LLM is slow." It is that the audio is being delivered in one big batch after the turn ends, instead of streaming during the turn. How do we get audio to the model while the caller is still speaking, packet by packet?

The answer is a live media stream / fork: the platform taps the RTP audio in real time and pushes it to your endpoint over a WebSocket (Twilio Media Streams), into a managed stream (Amazon Connect Kinesis Video Streams), over a SIP fork (SIPREC, used by Genesys AudioConnect and many recorders), or via a WebRTC track (LiveKit). The model receives 20 ms frames as they arrive and starts transcribing immediately.


Rule: live audio must be forked and streamed, never batched

The load-bearing rule of the media layer: for any real-time bot, audio must reach the model as a live stream of small frames during the turn, and the bot's speech must inject back into the live call — batching either direction destroys the turn budget. Recording is for analytics (chapter 06); forking is for conversation.

Why this rule exists. The primitive is that conversation latency is bounded by when the model first sees audio, and a batch delivery sets that floor at the full utterance length. Streaming moves the floor down to one frame (~20 ms) plus transport, letting ASR work in parallel with the caller still talking. The constraint that forces it: a turn budget under one second cannot absorb a multi-second batch delay.


1) How the fork actually works — the round trip in milliseconds

Trace the audio round trip for our billing bot when the caller says "what's my balance?"

                  THE MEDIA ROUND TRIP (one turn)

 Caller mouth
    │ speech
┌──────────┐  G.711 μ-law, RTP 20ms frames over UDP
│ Carrier  │───────────────────────────────────────────┐
└──────────┘                                            │
    ▲                                                    ▼
    │ inject bot audio                          ┌──────────────────┐
    │ back into call                            │  CCaaS platform  │
    │                                           │  forks the media │
┌───┴──────────┐                                └────────┬─────────┘
│ Carrier      │                                         │ WebSocket / KVS / SIPREC
│ (mixes in)   │                                         ▼   (~50–150ms transport)
└──────────────┘                              ┌─────────────────────┐
        ▲                                      │  YOUR MEDIA BRIDGE  │
        │ TTS audio frames                     │  decode → resample  │
        │ (streamed back)                      │  → push to ASR      │
┌───────┴────────┐                             └──────────┬──────────┘
│  TTS engine    │◀── text ──┐                            │ frames
└────────────────┘           │                            ▼
                       ┌──────┴──────┐              ┌────────────┐
                       │ Orchestrator│◀── transcript│  STREAMING │
                       │  + LLM      │              │    ASR     │
                       └─────────────┘              └────────────┘

The media path alone — before ASR, LLM, or TTS do any thinking — typically spends 50–150 ms each way in transport and buffering. On Twilio Media Streams, audio forks over a WebSocket as base64 μ-law frames; you decode and resample to whatever your ASR wants (often 16 kHz). On Amazon Connect, customer audio streams via Kinesis Video Streams (historically customer-side; agent/bot audio handled separately). The injection path back — getting TTS audio into the live call — runs through the same platform, either by streaming media back over the same channel or by having the platform play your synthesized audio.

For the billing bot, this means the turn budget is already ~150–300 ms in the hole on transport round trip before the ASR emits its first interim word. Chapter 04 budgets the rest; this is the tax you pay just to be on the call.


2) Picture: the bot as a third party tapped into a two-party call

The mental model that keeps barge-in and transfer straight: a phone call is fundamentally a two-leg bridge (caller-leg and the other-leg), and the bot is a tap-and-inject sitting on the bridge.

        ┌─────────────────── THE CALL BRIDGE ───────────────────┐
        │                                                        │
   Caller leg  ════════════════════════════════════════  Bot/agent leg
        │           ▲ tap (copy out)        ▼ inject            │
        │           │                       (play in)           │
        │       ┌───┴───────────────────────┴───┐               │
        │       │   AI media bridge             │               │
        │       │   listens + speaks            │               │
        │       └───────────────────────────────┘               │
        └────────────────────────────────────────────────────────┘

   Transfer = swap the bot leg for a human leg, KEEP the caller leg up
              (and pass the baton — ch 07)

Two things become obvious from this picture. First, barge-in is about the bridge: when the caller talks while the bot is playing, both audio streams exist on the bridge simultaneously, and the bot must detect the caller's energy and stop its own playback fast. Second, transfer is a leg swap: you don't end the call, you replace the bot leg with a human leg while keeping the caller leg up — which is exactly why the baton (chapter 07) must move at that instant. The recorder, meanwhile, taps the same bridge (chapter 08), which is the seam where a card number can leak.


3) The running example: barge-in on the billing bot

Thread the billing call. The bot starts reading a long balance breakdown: "Your current balance is fifty-nine dollars, which includes a monthly service charge of—" and the caller cuts in: "no no, I just want to pay it." A human would stop instantly. The bot must too.

Attempt A — wait for the bot to finish (no barge-in)

The bot keeps reading the full breakdown while the caller is talking over it. By the time it finishes, the caller has repeated themselves twice and is irritated. Worse, the caller's "I just want to pay it" landed during the bot's playback, and if the ASR was muted during TTS (a common naive choice to avoid transcribing the bot's own voice), the bot never even heard it. Handle time inflates; the call feels robotic.

Attempt B — barge-in: detect caller energy, stop playback, keep listening

The bridge keeps the caller's audio flowing to a fast voice-activity / energy detector even while the bot is speaking. The instant the caller's speech crosses threshold for ~100–200 ms, the bridge (1) stops the bot's TTS playback immediately, (2) flushes the queued unspoken TTS audio, and (3) treats the caller's incoming speech as a new turn for ASR. The bot says "...service char—" and goes quiet; the caller's "I just want to pay it" is captured and acted on.

The hard part hiding here: echo. The bot's own voice, played into the caller leg, can bleed back and trip the energy detector — the bot interrupts itself. So the bridge needs echo cancellation or it must know which audio is its own. This is a media-layer problem, not a dialog-layer one, and it's why barge-in lives in this chapter.


4) Why fork the media instead of using the platform's bot node — choosing under a control workload

CCaaS platforms offer a built-in "bot" or "Lex/Dialogflow" node you can drop into the IVR flow. Why fork raw media to your own bridge instead?

  • Use the platform bot node — fastest to wire, the platform handles media and barge-in for you. But you're locked to the platform's ASR/TTS choices, its barge-in tuning, and its latency. You cannot bring Deepgram + your LLM + Cartesia and own the turn budget.
  • Fork raw media to your own bridge — you decode, run your own ASR/endpointing, your own barge-in policy, your own TTS, and inject back. Full control of the turn budget and component choice, at the cost of building and operating the bridge.

For the billing bot under a control-and-latency workload — where you've decided (chapter 04) to own a sub-800 ms turn and need a specific phone-tuned ASR (chapter 03) — forking wins. If you just need a simple FAQ deflection and don't care which ASR, the platform node is fine. The deciding question: do you need to own the turn budget? If yes, fork.


5) The property that changes everything: jitter and packet loss are not optional edge cases

The dimension people underestimate is network jitter — the variance in packet arrival time. Telephony audio is 20 ms frames; if frames arrive at 18, 24, 19, 35, 16 ms intervals, the jitter buffer must hold a few frames to smooth playback, which adds latency (more buffer = smoother but slower). And UDP packet loss means some 20 ms frames simply vanish.

  Sent:    [f1][f2][f3][f4][f5][f6]   (every 20ms)
  Arrived: [f1]   [f3][f2][f4]   [f6]  (f5 lost, f2/f3 reordered, gaps)
  Buffer:  hold 2-3 frames → reorder → conceal loss → steady 20ms out
           ↑ smoother playback                ↑ but +40–60ms latency

For ASR this matters twice. A lost frame in a digit sequence — "account ending four-four-one-seven" — can turn 4417 into 417 or 4_17, a verification failure (chapter 07). And a bigger jitter buffer steals from the turn budget. The honest engineer treats a target loss/jitter profile as a design input: phone audio is lossy and band-limited, so the ASR must be robust to it and the verification logic must confirm digits, not trust a single pass.


6) One failure walked through: the half-second of dead air from a fat jitter buffer

Incident: the billing bot feels sluggish only for callers on certain mobile carriers. ASR latency looks fine in the dashboard. LLM latency is fine. TTS is fine. Yet those callers experience a clear extra pause every turn.

The chain: those carriers had higher jitter. The media bridge's jitter buffer was configured conservatively — 5 frames (100 ms) — to avoid audio glitches. For low-jitter callers that buffer mostly passed through. For high-jitter callers it stayed full, adding ~100 ms every turn in each direction — ~200 ms round trip — on top of the transport latency. None of the component dashboards showed it because the latency was in the media layer, between the components everyone was watching.

The fix was not a faster model. It was an adaptive jitter buffer that shrinks when the network is calm, plus measuring latency end-to-end (caller-stops → bot-starts) instead of per-component. The lesson, again: the media layer is part of the turn budget and it is invisible if you only watch component metrics. This is the same blind spot as the cold-transfer baton in chapter 01 — the failure lives between the boxes everyone instruments.


7) Cost and latency movement: where the milliseconds and pennies go in the media layer

Per-minute media costs and one-way latency contributions (illustrative; varies by region and contract):

Media path One-way latency added Cost driver What it buys
Twilio Media Streams (WebSocket fork) ~50–120 ms transport per-minute voice + streams full control, easy fork
Amazon Connect + KVS ~80–150 ms per-minute + KVS AWS-native, deep Connect integration
SIPREC fork (Genesys/recorders) ~40–100 ms trunk + recorder carrier-grade, standards-based
WebRTC / LiveKit (browser leg) ~20–60 ms media server lowest latency, wideband Opus
Jitter buffer +20–100 ms (tunable) none smooths loss/reorder, steals budget

The pressure evolution: forking relieves the batch-delay pressure (you hear audio during the turn) but creates a new pressure — a live media bridge you must operate, scale, and keep low-jitter — absorbed by your media/infra team. And the jitter buffer is a direct latency-vs-quality knob: every 20 ms you add for smoothness is 20 ms you spend from the turn budget chapter 04 has to fit ASR, LLM, and TTS into.


8) Signals that the media layer is the problem

Healthy: stable per-turn round-trip latency across carriers, near-zero perceived audio glitches, barge-in stop within ~200 ms of caller speech.

First metric to degrade: per-turn end-to-end latency variance across carrier/region segments. Media problems show up as variance — some callers fine, some slow — long before any component's average moves.

Misleading metric people watch: per-component ASR/LLM/TTS latency dashboards. They can all be green while the media bridge silently adds 200 ms of jitter-buffer delay between them.

First graph an expert opens: end-to-end turn latency (caller-stops → bot-starts) segmented by carrier and codec, plus packet-loss and jitter rate. A latency cliff on one carrier with elevated jitter is the media layer confessing.


9) Boundary: where forking shines, where it breaks down

Forking + your own bridge shines on high-volume voice lines where you need a specific ASR/TTS and a tight turn budget — exactly the billing bot. It also shines when you need barge-in tuned for your callers' speech patterns.

It becomes pathological when network conditions are poor and uncontrolled — international PSTN legs with high loss, satellite links, bad VoIP — where no jitter buffer setting wins and ASR accuracy collapses on the lossy band-limited audio. The scale limit: a single media bridge process handles a bounded number of concurrent 20 ms streams (CPU for decode/resample/echo per call); at thousands of concurrent calls you must horizontally scale bridges and pin each call's media to one bridge, or audio frames for one call scatter across processes and reorder worse.


10) Wrong assumption: "audio quality only affects how nice it sounds"

The seductive idea: phone audio is a cosmetic concern — it sounds a bit tinny but the words are there. Wrong. The 8 kHz μ-law band and packet loss directly degrade ASR accuracy on exactly the highest-stakes tokens — digits, names, account numbers — because those carry little redundancy and a single lost frame changes them. Audio quality is a correctness input, not an aesthetic one.

Replace it with: phone audio is a lossy, band-limited channel, so the system must verify high-stakes tokens (digits, IDs) rather than trust one ASR pass. This is why chapter 07's authentication confirms the account number back to the caller instead of trusting a single transcription — the media layer guarantees it sometimes hears wrong.


11) Other ways the media layer bites

  • TTS not stopping on barge-in — playback flush is missed, so the bot keeps talking 1–2 seconds after the caller interrupts.
  • Self-echo barge-in — no echo cancellation; the bot's own voice trips its barge-in detector and it interrupts itself.
  • Codec mismatch — ASR fed μ-law as if it were PCM; transcription is garbage until resampling is fixed.
  • DTMF in the audio band — keypad tones ride in-band; if the bridge doesn't extract them as events, "press 1" is heard as noise (and a card-DTMF leak risk, chapter 08).
  • One-way audio on transfer — leg swap done wrong; caller can hear the human but not vice versa.
  • Recording the wrong mix — single-channel recording can't separate caller from bot, breaking diarization (chapter 03) and QA (chapter 06).
  • Media bridge crash mid-call — call drops or goes silent; no graceful re-bridge.
  • Clock drift — long calls slowly desync send/receive timing, accumulating latency or glitches.

12) Pattern transfer

  • Stream vs batch — the same locality/latency tradeoff as log-structured streaming vs nightly batch jobs: streaming pays per-event overhead to slash end-to-end latency. Forking is "process the event now"; recording is "process the batch later." Same pressure, different layer.
  • The jitter buffer is a backpressure/smoothing buffer — structurally identical to a network smoothing buffer or a Kafka consumer's prefetch: it trades latency for tolerance of arrival-rate variance. Bigger buffer, smoother, slower — the universal buffering tradeoff.
  • Barge-in is preemption — same shape as an OS preempting a running task for a higher-priority one: the caller's speech preempts the bot's playback, and the bot must save/discard its in-flight work (queued TTS) cleanly. Botched preemption leaves the bot talking over itself, like a missed context switch.

13) Design test

  1. Does your bot hear audio during the turn (fork) or only after it (batch)? If batch, the turn budget is already blown.
  2. On barge-in, do you flush queued TTS within ~200 ms, or does the bot keep talking?
  3. Do you measure end-to-end turn latency segmented by carrier, or only per-component averages?
  4. Are high-stakes tokens (digits, account numbers) verified, given the channel can drop frames?
  5. Does your media bridge scale horizontally with each call pinned to one bridge instance?

Where this appears in production

  • Twilio Media Streams — forks raw call audio over WebSocket (and SIPREC) so your AI hears Programmable Voice calls live.
  • Amazon Connect + Kinesis Video Streams — streams customer audio to AWS for live ASR and analytics.
  • Genesys AudioConnect / Audiohook — forks live media from Architect flows to an external bot.
  • SIPREC — the standards-based session-recording fork used by carrier-grade recorders and many bridges.
  • LiveKit Agents — WebRTC media transport + agent runtime; lowest-latency wideband path for app/browser legs.
  • Pipecat — open-source pipeline that wires media frames through VAD, ASR, LLM, TTS with barge-in handling.
  • Daily — WebRTC infrastructure used under voice-agent stacks for media transport.
  • Asterisk / FreeSWITCH — open-source telephony engines where the bridge, codecs, and DTMF events are managed.
  • Vapi / Bland — managed voice-agent platforms that own the media bridge and barge-in so you don't build it.
  • Deepgram — phone-tuned streaming models that accept μ-law to survive the 8 kHz band-limited channel.
  • Krisp — noise and echo suppression on the media path, reducing self-echo barge-in and background noise.
  • Cartesia Sonic — streaming TTS with ~75–90 ms time-to-first-audio so injected speech starts fast on the bridge.
  • Webex Contact Center media — enterprise media handling with bot insertion points.
  • Jambonz — open-source CPaaS that exposes media forking and call control for custom voice bots.

Recall

  1. Why does delivering audio as a recorded file destroy a live conversation?
  2. What is the typical packet cadence of RTP telephony audio, and why does a lost packet matter for an account number?
  3. What does the jitter buffer trade, and how does it steal from the turn budget?
  4. Why is barge-in a media-layer problem before it's a dialog problem?
  5. What is self-echo barge-in and what fixes it?
  6. Why does a transfer keep the caller leg up and swap only the other leg?
  7. Why must high-stakes tokens be verified rather than trusted from one ASR pass?

Interview Q&A

Q1. Your voice bot feels slow on some carriers but every component dashboard is green. Where do you look? The media layer between the components — specifically the jitter buffer and transport. High-jitter carriers keep the buffer full, adding ~100 ms each way that no component metric shows. Measure end-to-end turn latency (caller-stops → bot-starts) segmented by carrier, and consider an adaptive jitter buffer. Common wrong answer to avoid: "swap to a faster LLM" — the LLM was never the bottleneck; the latency is in media you weren't measuring.

Q2. Why not just use the CCaaS platform's built-in bot node instead of forking media? Because the built-in node locks you to the platform's ASR/TTS, barge-in tuning, and latency. If you need a specific phone-tuned ASR and to own a sub-800 ms turn budget, you fork raw media to your own bridge and run your own pipeline. If you just need simple FAQ deflection, the node is fine. The deciding question is whether you must own the turn budget. Common wrong answer to avoid: "always fork, it's more flexible" — for a low-stakes FAQ deflection, forking is wasted engineering and operational burden.

Q3. The bot keeps talking for a full second after the caller interrupts. What's broken? The barge-in path isn't flushing queued TTS. Detecting caller energy isn't enough — you must immediately stop playback and discard the unspoken TTS frames already queued to the media bridge. If you only stop generating new audio, the buffered tail keeps playing. Also check for missing echo cancellation causing false or delayed triggers. Common wrong answer to avoid: "lower the barge-in threshold" — the detection may be firing fine; the bug is the un-flushed playback queue.

Q4. Account-number verification fails intermittently for no clear reason. Could the media layer cause this? Yes. Phone audio is 8 kHz μ-law over lossy UDP; a single dropped 20 ms frame can turn 4417 into 417. Digits carry little redundancy, so packet loss hits them hardest. The fix isn't only a better ASR — it's verifying high-stakes tokens (read the number back, confirm) rather than trusting one pass, plus a phone-tuned ASR robust to the band. Common wrong answer to avoid: "the ASR model is bad, switch vendors" — even a perfect model can't recover audio that never arrived; design for verification.

Q5. How does a warm transfer work at the media layer, and what must happen at that instant? A call is a two-leg bridge. Transfer swaps the bot leg for a human leg while keeping the caller leg up — the caller never drops. At that instant the context baton (identity, transcript, work done) must attach to the call so the human's screen pops it (chapter 07). Media-wise you keep audio continuity; data-wise you propagate context. Common wrong answer to avoid: "end the call and have the human call back" — that drops the caller and guarantees a cold restart.

Q6. At 5,000 concurrent calls your media bridge starts glitching. What's the scaling fix? Each call's 20 ms frames must be processed in order by one bridge instance (decode/resample/echo are stateful per call). Horizontally scale bridge instances and pin each call's media to a single instance via consistent routing. If frames for one call scatter across processes they reorder and glitch. CPU per call (decode + echo cancellation) sets the per-instance ceiling. Common wrong answer to avoid: "add a load balancer that round-robins frames" — round-robining a single call's frames across instances destroys ordering and per-call state.

Q7. (Cumulative) Is a digit that gets misheard a chapter 2 media problem or a chapter 3 ASR problem? It can be either, and distinguishing them is the skill. If packets were lost/jittered (media, ch 02), even a perfect ASR fails — check loss/jitter on that call's segment. If audio was clean but the model still misheard (ASR, ch 03), it's a model/endpointing issue. The diagnostic is to inspect the media-layer loss/jitter for that call first; clean audio points to ASR, lossy audio points to media. Common wrong answer to avoid: "always an ASR problem" — that ignores that lossy audio defeats any model and sends you tuning the wrong layer.


Design/debug exercise (10 min)

Step 1 — Modeled example. Walk the barge-in sequence for "no no, I just want to pay it" (section 3, Attempt B): caller energy crosses threshold → stop TTS playback → flush queued TTS frames → start new ASR turn → handle echo. Write the one failure that occurs if each step is skipped.

Step 2 — Your turn. The billing bot must collect the account number "4-4-1-7" over a lossy mobile leg. Design the verification: how does the bot guard against a dropped-frame digit error? Specify what it reads back, what it asks, and how it ties into the chapter-07 CRM lookup. Note where you'd measure packet loss for that call.

Step 3 — Reproduce from memory. Redraw the two-leg call bridge with the AI tap-and-inject (section 2) cold, then mark on it: where barge-in detection sits, where transfer swaps a leg, and where the recorder taps (the chapter-08 PCI-leak seam). Connect it back to chapter 01: which box on the stack does the bridge sit inside?


Operational memory

This chapter explained why a bot with a perfect ASR and LLM can still feel broken: the audio path adds latency before the model hears anything, drops frames that mangle digits, and can let the bot talk over a caller who's interrupting. The important idea is that audio must be forked and streamed live during the turn, not batched after it, and that barge-in and jitter are media-layer problems that live between the components everyone instruments.

You learned to read a call as a two-leg bridge with the AI tapped in to listen and inject, to spend the first ~150–300 ms of turn budget knowingly on transport and jitter, and to handle barge-in by flushing queued TTS — not just stopping generation. That solves the opening dead-air and talk-over failures because both were media problems masquerading as model problems.

Carry this diagnostic forward: when latency varies by caller, measure end-to-end turn latency segmented by carrier before touching the model. When a digit is misheard, check packet loss for that call's segment before blaming the ASR. The failure usually lives between the green dashboards.

Remember:

  • Live audio must be forked and streamed during the turn; recording is for analytics, not conversation.
  • The media path costs ~150–300 ms round trip before any model thinks — that's turn budget you spend on purpose or by accident.
  • Barge-in = stop playback and flush queued TTS within ~200 ms; handle self-echo or the bot interrupts itself.
  • Phone audio is lossy and band-limited, so verify high-stakes tokens (digits, IDs) instead of trusting one ASR pass.
  • A transfer keeps the caller leg up and swaps the other leg; the context baton must move at that instant.

Bridge. We can now get clean-as-possible audio frames to the model and inject speech back without stepping on the caller. But raw frames are not words — the bot needs to know what was said and, harder, when the caller is actually done talking so it can take its turn. Turning a lossy stream into reliable transcripts, and detecting the end of a turn without cutting the caller off, is the next seam. → 03-realtime-asr-and-endpointing.md