04. Spending a sub-second turn across think, fetch, and speak — without the call going dead¶
~18 min read. The bot has the words and knows the caller is done. Now it has roughly 800 milliseconds — minus what ASR and the media path already spent — to decide what to do, fetch what it needs, generate a reply, and start speaking. Run those steps in a straight line and the caller hears a second of silence and hangs up.
Built on 03-realtime-asr-and-endpointing.md. This is where the turn budget is actually spent. It inherits endpointing from chapter 03, the media tax from chapter 02, and decides when to give up and trigger warm vs cold transfer — the human fallback. Get this wrong and every other layer's work is wasted on a call that feels dead.
Note: the voice-agents module covers LLM serving, KV cache, and decoding internals. This chapter assumes that and focuses on the contact-center seam: orchestrating NLU/dialog/LLM/tools under a hard wall-clock turn budget, streaming to hide latency, and deciding when to fall back to a human.
What perception gave us, and the clock it left running¶
Chapter 03 delivered a stable final transcript and a confident "the caller is done." But "done" started a clock. From the instant the caller stops talking, a human expects a reply in roughly 300–500 ms to feel natural, and tolerates up to ~800 ms before the conversation feels laggy. Past a second of silence, callers assume the line dropped, repeat themselves, or hang up.
Inside that window the bot must do real work: figure out what the caller wants (intent / NLU), decide what to do next (dialog policy), often call a tool or API (CRM lookup, balance fetch), generate language (the LLM), and turn it into audio (TTS). Five steps. And chapters 02 and 03 already spent ~300–500 ms of the budget on transport, jitter, and endpointing before this chapter gets the floor. The orchestration job is to fit the rest into what's left.
By the end you can lay out the end-to-end latency budget in concrete milliseconds, name which steps can overlap, see how streaming the LLM into TTS hides most of the latency, and know when the right move is to stop trying and hand the caller to a human with the baton.
What this file solves¶
A bot with great ASR, a great LLM, and great TTS can still feel dead because the steps run serially and the budget is gone before the first word comes out. This file shows the turn-budget math (what each component costs, what's already spent), how to overlap and stream so the caller hears speech ~500 ms after they stop instead of ~2 s, and when to abandon the turn and fall back to a human — so the bot feels alive and knows its own limits.
Why the obvious pipeline blows the budget¶
The natural way to build the turn: wait for the final transcript, classify the intent, run the dialog policy, call the CRM, send everything to the LLM, wait for the full response, send that text to TTS, wait for the full audio, play it. Each step finishes before the next begins. It's correct and easy to reason about.
Add up one turn of "what's my balance" on the billing line, serially:
media/endpoint already spent ~400 ms (ch 02 + ch 03)
NLU / intent classify ~80 ms
CRM balance lookup ~250 ms
LLM full response (TTFT 350ms + ~900 ms generate all ~40 tokens
decode to completion)
TTS full audio synthesis ~400 ms synthesize whole sentence
────────────────────────────────────────
caller hears first word at ~2030 ms ← dead. they've repeated themselves.
Two seconds of silence. The caller said "what's my balance," waited, heard nothing, and said "hello? are you there?" The bot is now processing a turn that the caller already abandoned.
So the real problem is not "the LLM is too slow" and not "TTS is too slow." It is that the steps run in series when most of them could overlap, and the bot waits for complete outputs when it only needs the first chunk to start speaking. How can the bot start talking before it has finished thinking?
That question is the whole orchestration discipline. The LLM streams tokens; the moment the first sentence is ready, send it to a streaming TTS, which emits its first audio in ~75–95 ms (Cartesia Sonic, ElevenLabs Flash). Meanwhile the CRM lookup ran in parallel during ASR, not after it. The caller hears speech while the bot is still generating the rest.
Rule: start speaking on the first chunk, and overlap everything that can overlap¶
The load-bearing rule of orchestration: never wait for a complete output when a partial one lets the next stage start — stream the LLM into the TTS, run tool calls in parallel with perception, and measure latency end-to-end (caller-stops → first-audio-byte), not per component. The felt latency is time-to-first-audio, not time-to-complete-answer.
Why this rule exists. The primitive is that conversation latency is bounded by when the caller hears the first word, not when the bot finishes the sentence. The constraint is the ~800 ms wall clock and the fact that serial composition sums every component's latency. Streaming and overlap change summation into a max-plus: the first audio comes out after the first useful chunk of each stage, not after all of them complete. That's the only way the arithmetic fits under a second.
1) The latency budget, laid out in milliseconds¶
Here is the same "what's my balance" turn, orchestrated — overlapped and streamed.
THE TURN BUDGET (overlapped + streamed)
caller stops
│
├─ media transport in ~80 ms ┐
├─ endpointing confirm ~250 ms ┘ (ch 02+03, partly overlaps audio)
│
│ ── CRM lookup fires on interim, runs in PARALLEL ──┐ ~250 ms
│ │ (hidden under ASR)
├─ NLU/intent (often fused into the LLM call) │
│ │
├─ LLM time-to-first-token (TTFT) ~300–350 ms ◀─────┘ uses lookup result
│ └─ first sentence ready
│ └─ stream to TTS ─ first audio byte ~90 ms
▼
caller hears first word at ~520–600 ms ← natural. feels alive.
... LLM keeps generating + TTS keeps synthesizing WHILE caller listens ...
The arithmetic that matters: serial was ~2030 ms; overlapped+streamed is ~520–600 ms. The CRM lookup vanished (it ran under ASR). The LLM's full generation and the rest of TTS happen while the caller is already hearing the first words. The headline number a senior engineer tracks is time-to-first-audio after end-of-turn, and the target is under ~800 ms, ideally ~500.
This matches the field: components are individually fast (Deepgram STT ~150 ms, ElevenLabs/Cartesia TTS 75–95 ms first byte, LLM TTFT ~300 ms), yet most agents land at 800 ms–2 s because stack latency compounds when stages run serially. The win is structural, not a faster GPU.
Teacher voice. The question that separates a working voice agent from a demo is not "what's your model's latency" — it's "what's your time-to-first-audio after end-of-turn, and which stages overlap?" Every serial dependency you can break is hundreds of milliseconds back in the caller's pocket.
2) Picture: the turn as a budget being drawn down¶
The mental model that keeps the budget honest: a turn is a fixed pot of ~800 ms, and every stage draws from the same pot. The only ways to win are to spend less per stage or to make stages spend concurrently from the pot instead of one after another.
THE 800ms POT — serial drains it; overlap shares it
SERIAL (drains): [media][endpoint][NLU][CRM][LLM full][TTS full] → 2000ms ✗
──────────────── one after another ────────────
OVERLAP (shares): [media+endpoint]
[CRM lookup .....] (parallel)
[LLM TTFT...][stream→TTS first byte] → 550ms ✓
▲ stages run concurrently, first-audio wins
Spend less: faster TTFT, faster TTS first byte, phone-tuned ASR
Spend concurrently: fire tools on interims, stream LLM→TTS
Spend nothing: cache/precompute (greeting, common answers)
Three levers, in order of leverage: overlap (free, structural), spend-less (faster components), spend-nothing (precompute the greeting, cache common answers). The billing bot's greeting — "Thanks for calling, this is the billing assistant" — should be pre-synthesized audio, not generated live; it costs zero budget and the budget should be saved for the parts that vary.
3) The running example: answering the balance question under budget¶
Thread the billing call. The caller, now authenticated (chapter 07), asks "what's my current balance and when is it due?"
Attempt A — serial orchestration¶
NLU classifies intent (~80 ms). Then the dialog policy decides to fetch balance. Then it calls the CRM (~250 ms). Then it builds a prompt and calls the LLM, waiting for the complete answer (~900 ms). Then it sends the full text to TTS and waits for the complete audio (~400 ms). First audio at ~2 s after the question. The caller said "hello?" at 1.2 s. The bot's eventual answer arrives into a caller who's already confused. Handle time inflates; the call feels broken.
Attempt B — overlapped, streamed orchestration¶
The moment the interim transcript shows "balance," the orchestrator speculatively fires the CRM balance lookup — it runs during endpointing, hidden under the ~250 ms the turn clock spends confirming end-of-turn. By the time the final transcript and endpoint land, the balance ($59, due the 15th) is already in hand. The LLM call includes the fetched balance; its first token arrives at ~300 ms; the first sentence ("Your balance is fifty-nine dollars") streams straight into TTS, whose first audio byte comes ~90 ms later. The caller hears "Your balance is..." at ~550 ms — natural. The LLM finishes "...due on the fifteenth" while the caller is still hearing the first half.
The hard part hiding here: speculative tool calls can be wrong. If the interim said "balance" but the caller actually asked "cancel my plan," the bot fired a balance lookup it didn't need. That's usually fine — a wasted read is cheap and the result is just discarded (same squash-on-misprediction shape as interim transcripts in chapter 03). But a speculative write (charging a card, changing a plan) must never fire on an interim. Reads can be speculative; writes wait for the confirmed final.
4) Why a cascading STT→LLM→TTS pipeline instead of a single speech-to-speech model — choosing under a contact-center workload¶
A tempting alternative in 2026 is a single end-to-end speech-to-speech model (audio in, audio out, no separate ASR/LLM/TTS). It can be lower latency and more natural, with no transcription step in the middle.
- Speech-to-speech (one model) — lowest latency, most natural prosody, handles interruptions fluidly. But you lose the intermediate transcript — and a contact center needs that transcript for CRM logging (chapter 07), QA and compliance (chapter 06), PII redaction (chapter 08), and for confirming entities (chapter 03). You also lose fine control over tool-calling and the ability to swap a phone-tuned ASR. It's a black box where you need auditability.
- Cascading STT→LLM→TTS — higher latency to manage, but you get the transcript at every stage, full control of tool calls and entity confirmation, swappable components, and a clear audit trail. The latency gap closes with overlap and streaming.
For a regulated, audited billing line where every call must produce a transcript, a disposition, and a redactable record, cascading wins despite the extra latency work. The deciding question: does the business need the intermediate transcript for logging, compliance, and tool control? In a contact center, yes — almost always.
5) The property that changes the design: not every turn deserves the LLM¶
The dimension people miss is that turns are not uniform. Some are pure transactions ("what's my balance") that need a tool call and a templated reply — no reasoning. Some are genuinely ambiguous ("I'm confused about this charge and I'm not sure I even want the service anymore") that need the LLM's judgment. Routing every turn through a full LLM call wastes budget and money on the easy ones and risks the LLM improvising on a transaction where determinism matters.
Turn type Path Budget Risk
─────────────────────────────────────────────────────────────────
"what's my balance" intent → tool → template ~500ms low (deterministic)
"pay $59" confirm → payment tool ~400ms low (but write-gated)
"I'm confused..." LLM reasoning + tools ~700ms medium (needs judgment)
"this is fraud!" classify → escalate to human fast high → fall back
The design move: a fast intent classifier in front of the LLM that routes deterministic turns to templated tool-and-respond paths and reserves the LLM for the turns that need reasoning. This cuts cost and latency and keeps the LLM out of the deterministic transaction path — the same "keep probabilistic components out of the deterministic decision path" pressure as chapter 01's defer-to-ACD rule, now inside a single turn.
6) One failure walked through: the LLM that hung the turn and never recovered¶
Incident: the billing bot mostly feels snappy, but ~3% of turns go completely dead — 5+ seconds of silence, then either a delayed answer or the bot saying something stale. It correlates with nothing obvious in the LLM dashboard, whose average TTFT looks fine.
The chain: those turns hit the LLM's tail latency. The model's TTFT is ~300 ms at p50 but ~4 s at p99 under load — a queueing spike, a long context, a slow tool call inside an agentic loop. The orchestrator had no timeout: it waited for the LLM however long it took. With no fallback, the caller got dead air, and because the orchestrator was blocked, even barge-in (chapter 02) couldn't cleanly recover — the bot wasn't listening, it was waiting.
The root cause is not "the LLM is slow on average"; it's that the turn had no deadline and no fallback. The fix: every turn gets a hard latency budget (say 1.2 s to first audio). If the LLM hasn't produced a first token by a threshold, the orchestrator plays a filler ("let me pull that up for you") to buy time and keep the line alive, and if the deadline blows entirely, it falls back — retry, a templated safe answer, or a warm transfer to a human. A voice turn is a real-time deadline, not a best-effort request. This is the same tail-latency-over-average lesson as chapter 03's entity accuracy, now on response time: the p99 turn is the one that loses the caller.
7) Cost and latency movement: where the budget and the dollars go¶
End-to-end turn budget contributions and rough per-minute cost (illustrative; varies by model, vendor, region):
| Stage | Latency contribution (overlapped) | Cost driver | Lever to cut it |
|---|---|---|---|
| Media transport + jitter | ~80–150 ms | telephony/min | adaptive jitter buffer (ch 02) |
| Endpointing | ~250 ms | ASR/min | semantic turn detection (ch 03) |
| Tool/CRM lookup | ~0 ms if parallel, ~250 ms if serial | API calls | fire on interim, run parallel |
| LLM TTFT | ~300–350 ms | per-token | smaller/faster model for easy turns; prompt caching |
| TTS first byte | ~75–95 ms | per-char | streaming TTS, pre-synth greetings |
| Time-to-first-audio | ~500–600 ms target | sum of above | overlap + stream |
Per-minute economics roughly: telephony ~\(0.01 + ASR ~\)0.01–0.02 + LLM ~\(0.01–0.05 + TTS ~\)0.01–0.03 ≈ $0.04–0.11/min unbundled (chapter 01). The pressure evolution: overlapping and streaming relieve the latency pressure but create complexity pressure — speculative tool calls that might be discarded, partial outputs that must be coherent, deadline/fallback logic — absorbed by the orchestration code and its on-call team. Routing easy turns around the LLM relieves cost and latency but creates a routing-correctness burden (a misrouted hard turn gets a dumb templated answer), absorbed by the intent classifier's quality.
8) Signals that orchestration is the problem¶
Healthy: time-to-first-audio after end-of-turn consistently under ~800 ms (ideally ~500), low filler-phrase rate, low fallback-to-human rate on turns the bot should handle.
First metric to degrade: p95/p99 time-to-first-audio, not the average. Orchestration problems live in the tail — the queued LLM call, the slow tool, the serial path that only some turns hit. The average can look healthy while 3% of turns go dead.
Misleading metric people watch: average LLM TTFT or average end-to-end latency. Averages hide the tail turns that actually lose callers, and per-component averages hide serial composition (each stage fine, the sum fatal).
First graph an expert opens: the distribution (not average) of time-to-first-audio after end-of-turn, plus a per-turn waterfall showing each stage's start/end so you can see what ran serially that should have overlapped. The second graph: fallback-to-human rate split by reason (deadline blown vs genuinely-too-hard) — deadline-driven fallbacks are an orchestration bug, not a capability limit.
9) Boundary: where tight orchestration shines, where it can't save you¶
Tight overlapped orchestration shines on transactional, low-to-medium-reasoning turns — balances, due dates, payments, simple disputes — where tool calls dominate and the LLM's job is small and streamable. Here the budget fits comfortably.
It can't save you when a turn genuinely requires multi-step reasoning or multiple sequential tool calls — an agentic loop that must look up the account, then the dispute history, then policy, then decide. Sequential tool calls can't all overlap (each depends on the last), and the budget blows. The right move there is not to fight the budget but to cover it: play a filler, set expectations ("this'll take a moment"), or fall back to a human. The scale limit that invalidates intuition: an orchestration that's snappy at low concurrency degrades as the shared LLM endpoint queues under load — the p99 budget you validated at 10 concurrent calls is not the p99 at 5,000, because the tail grows with contention.
10) Wrong assumption: "use the best, biggest LLM and the bot will be best"¶
The seductive idea: the smartest model makes the best voice agent. For a real-time phone call, often the opposite. A bigger model has higher TTFT and tail latency, and for a billing line where most turns are transactional, its extra reasoning is wasted while its extra latency is felt on every turn. A smaller, faster model (or a routed mix — small model for easy turns, big for hard) usually makes a better voice agent.
Replace it with: for real-time voice, response speed and predictability matter as much as raw capability; match the model to the turn, not the hardest imaginable case. This reorders model selection: validate on time-to-first-audio and p99, not just on answer quality — and it's exactly why chapter 06's offline analytics, with no real-time deadline, can use bigger, slower, more capable models than this live layer.
11) Other ways orchestration bites¶
- No deadline on the LLM — one slow turn hangs the call with dead air and no recovery (the section-6 failure).
- Speculative write fired on an interim — the bot charges a card based on a transcript that later revised; reads can be speculative, writes cannot.
- Filler never played — the bot goes silent during a legitimately slow tool call instead of saying "one moment."
- Greeting generated live — the fixed greeting wastes budget and money; it should be pre-synthesized.
- TTS waits for full LLM output — streaming not wired, so first audio waits for the last token.
- Context window bloat — the whole transcript stuffed into every LLM call, inflating TTFT turn after turn; summarize/trim.
- Tool-call storms — an agentic loop makes five sequential API calls inside one turn and blows the budget with no filler.
- No backpressure on the shared LLM — under load, every turn's tail grows; concurrency isn't capped or routed.
12) Pattern transfer¶
- Time-to-first-audio is time-to-first-byte — same shape as streaming an HTTP response or a video: the user-perceived latency is when the first chunk arrives, not when the whole payload completes. Stream the answer; don't buffer it. The shared pressure: perceived latency is a first-chunk problem.
- Overlap is pipelining — structurally identical to CPU instruction pipelining or a build system running independent steps concurrently: independent stages (tool call, endpointing) run at once, dependent stages (LLM needs the lookup) serialize. The win comes from finding what's independent.
- Deadline + fallback is a circuit breaker — same failure geometry as a service call with a timeout and a fallback path: never wait unbounded on a downstream that can hang. A voice turn with no deadline is an un-timed RPC that can dead-air the whole call.
13) Design test¶
- Do you measure time-to-first-audio after end-of-turn, or only per-component averages?
- Do tool reads fire speculatively on interims and run parallel to perception, while writes wait for the confirmed final?
- Does the LLM stream into a streaming TTS, so first audio comes on the first sentence, not the last token?
- Does every turn have a hard deadline with a filler and a human fallback when it blows?
- Are easy transactional turns routed around the full LLM, keeping it out of the deterministic path?
Where this appears in production¶
- Pipecat — open-source orchestration that streams STT→LLM→TTS concurrently under a ~300 ms budget, with turn detection, interruptions, and per-stage metrics frames to see where the budget goes.
- LiveKit Agents — WebRTC media + agent runtime that overlaps perception, reasoning, and synthesis.
- Twilio ConversationRelay — packages streaming ASR + LLM + TTS with barge-in for Programmable Voice.
- Vapi / Bland — managed voice-agent platforms that wire the overlapped pipeline and expose latency knobs.
- AssemblyAI pre-emptive LLM generation — starts the LLM before end-of-turn to claw back 200–500 ms.
- Cartesia Sonic — streaming TTS with sub-100 ms first byte so the first sentence speaks fast.
- ElevenLabs Flash v2.5 — ~75 ms first-audio TTS for low-latency turns.
- OpenAI / Anthropic streaming token APIs — token-by-token streaming so TTS starts on sentence one.
- Amazon Bedrock AgentCore + Pipecat — deploying overlapped voice agents with managed tool calling.
- Speech-to-speech models (GPT realtime, Gemini Live) — the lower-latency alternative you trade auditability for.
- Prompt caching (Anthropic/OpenAI) — caches the static system prompt/policy so per-turn TTFT drops.
- Semantic/router models — a fast classifier in front of the LLM that routes easy turns to templated paths.
- NICE Enlighten / Genesys built-in orchestration — turnkey dialog orchestration when you don't own the budget.
- Amazon Lex — intent/slot dialog management used as the deterministic front of a hybrid bot.
Recall¶
- Why does a serial STT→LLM→TTS pipeline blow the turn budget even with fast components?
- What is time-to-first-audio, and why is it the metric to track instead of per-component latency?
- How does streaming the LLM into TTS hide most of the latency?
- Which tool calls can fire speculatively on an interim, and which must wait for the confirmed final?
- Why does a cascading pipeline beat a single speech-to-speech model in a contact center?
- What two things must every turn have so one slow LLM call doesn't dead-air the call?
- Why might a smaller LLM make a better voice agent than a bigger one?
Interview Q&A¶
Q1. Your voice bot has a fast ASR, a fast LLM, and a fast TTS, but it feels laggy. What's the likely cause? The stages are running serially, so their latencies sum to ~2 s even though each is fast. Fix it structurally: fire tool reads in parallel with perception, stream the LLM token-by-token into a streaming TTS so the first audio comes on the first sentence, and measure time-to-first-audio after end-of-turn, not per-component averages. The win is overlap, not a faster model. Common wrong answer to avoid: "upgrade to a faster GPU/model" — the components are already fast; the latency is in serial composition, and a faster model barely moves a summed pipeline.
Q2. Why not use a single speech-to-speech model — it's lower latency and more natural? Because a contact center needs the intermediate transcript for CRM logging, QA and compliance, PII redaction, and entity confirmation, plus fine control over tool calls and swappable phone-tuned ASR. Speech-to-speech is a black box where you need an audit trail. The cascading pipeline's latency gap closes with overlap and streaming; the auditability gap of speech-to-speech doesn't close at all. Common wrong answer to avoid: "speech-to-speech is the future, always use it" — in a regulated, audited line, losing the transcript loses logging, compliance, and redaction you're legally required to have.
Q3. About 3% of turns go completely dead for several seconds. Diagnose it. LLM tail latency with no deadline. The p99 TTFT spikes under load or long context to several seconds, and the orchestrator waits unbounded, so the caller gets dead air and barge-in can't even recover because the bot is blocked. Fix: a hard per-turn deadline, a filler phrase to keep the line alive past a threshold, and a fallback (retry, safe template, or warm transfer) when the deadline blows. Common wrong answer to avoid: "the average LLM latency is fine, it's noise" — the average hides the p99 turns, which are exactly the ones losing callers; a voice turn is a real-time deadline.
Q4. Should every turn go through the LLM? No. Transactional turns ("what's my balance") need a tool call and a templated reply, not reasoning — routing them through the LLM wastes budget and money and risks improvisation on a deterministic transaction. Put a fast intent classifier in front that routes easy turns to templated tool-and-respond paths and reserves the LLM for turns needing judgment. It's the same "keep the probabilistic component out of the deterministic path" rule as deferring to the ACD, inside one turn. Common wrong answer to avoid: "yes, the LLM should handle everything for consistency" — that adds latency and cost to every easy turn and lets the model improvise where determinism matters.
Q5. The bot fires a CRM lookup on the interim transcript and sometimes it's the wrong lookup. Is that a bug? Not for a read — a speculative read on an interim is the standard latency-hiding move, and a wrong one is just discarded (squash-on-misprediction, like a revised interim transcript in chapter 03). It is a bug if a speculative write fired — charging a card or changing a plan on an unconfirmed interim. The rule: reads can be speculative, writes wait for the confirmed final plus a confident endpoint. Common wrong answer to avoid: "never act on interims, it's unsafe" — that throws away the main latency lever; the discipline is speculate reads, gate writes.
Q6. The bot works great in testing at low load but goes laggy in production. Same code. What changed? Concurrency and contention. The shared LLM endpoint queues under production load, so the p99 TTFT you validated at 10 concurrent calls is far worse at 5,000 — the tail grows with contention. Cap and route concurrency, add prompt caching to cut per-turn TTFT, and validate latency at production concurrency, not in isolation. The per-turn logic is unchanged; the tail grew. Common wrong answer to avoid: "the code regressed" — the code is identical; the latency distribution changed under load, which only shows up when you test at real concurrency.
Q7. (Cumulative) A turn feels slow and you can't tell if it's chapter 2 media, chapter 3 endpointing, or chapter 4 orchestration. How do you localize it? Open the per-turn waterfall: media transport, endpointing confirm, tool call, LLM TTFT, TTS first byte, each with start/end timestamps. Variance by carrier points to media (ch 02); a long gap between caller-stop and the transcript settling points to endpointing (ch 03); a long gap between final transcript and first token, or stages that ran serially, points to orchestration (ch 04). The waterfall makes the layer confess. Common wrong answer to avoid: "look at the average end-to-end latency" — the average can't localize the layer; you need the per-stage, per-turn waterfall to see what ran serially and where the time went.
Design/debug exercise (10 min)¶
Step 1 — Modeled example. Lay out the "what's my balance and when's it due" turn as a waterfall (section 1): media+endpoint, parallel CRM lookup, LLM TTFT, stream to TTS first byte. Mark which stages overlap and compute time-to-first-audio. Then write the serial version's number and the gap.
Step 2 — Your turn. Take a harder billing turn: "I want to dispute this charge and I'm thinking of cancelling." This needs sequential tool calls (dispute history, then account, then policy) — they can't all overlap. Design the orchestration: what's the deadline, what filler do you play, when do you decide it's too hard and fall back to a human, and what goes in the baton (chapter 07) at that handoff?
Step 3 — Reproduce from memory. Redraw the 800 ms pot diagram (section 2) cold: serial draining it vs overlap sharing it, with the three levers (spend less / spend concurrently / spend nothing) labeled. Connect it back to chapter 03: show where the speculative CRM read hides under endpointing, and to chapter 02: where the media tax is already drawn from the pot before this layer starts.
Operational memory¶
This chapter explained why a bot built from fast components can still feel dead: the steps run serially and sum to two seconds, and the bot waits for complete outputs when it only needs the first chunk to start speaking. The important idea is that the felt latency is time-to-first-audio after end-of-turn, and the only way to fit under ~800 ms is to overlap independent stages and stream the LLM into the TTS — plus give every turn a deadline and a human fallback.
You learned to fire tool reads speculatively on interims and run them parallel to perception, stream tokens straight into a streaming TTS so the first sentence speaks at ~550 ms, route easy transactional turns around the LLM, and treat a turn as a real-time deadline with a filler and a fallback. That solves the opening dead-air failure because it was never a slow-component problem — it was a serial-composition and no-deadline problem.
Carry this diagnostic forward: when a bot feels laggy, open the per-turn waterfall and find what ran serially that should have overlapped, and look at p99 time-to-first-audio, not the average. When turns go dead, check for a missing deadline and fallback before blaming the model.
Remember:
- Felt latency is time-to-first-audio after end-of-turn; overlap turns a sum into a max-plus.
- Stream the LLM into a streaming TTS; start speaking on the first sentence, not the last token.
- Reads can be speculative on interims; writes wait for the confirmed final.
- Every turn needs a hard deadline, a filler to cover slow tools, and a warm-transfer fallback.
- Match the model to the turn; for real-time voice, speed and p99 predictability rival raw capability.
Bridge. We can run an autonomous bot that thinks and speaks under budget and falls back to a human when it's beaten. But the most reliable place to put AI is not alone with the caller — it's beside a human agent, listening to a call it never has to drive, suggesting answers in real time. That removes the turn-budget pressure entirely and changes the failure modes, which is the next chapter. → 05-agent-assist-realtime-guidance.md