Skip to content

02. Voice & Realtime AI — Narrative Explainer

Voice systems rarely fail because the model is weak. They usually fail because the pauses feel unnatural. This module explains how to remove those painful pauses. Read it slowly, like a senior engineer teaching a launch review.

Table of Contents

  • §1. ELI5 — The live interpreter at the UN
  • §2. Chapter 1 — Opening failure: the five-second assistant
  • §3. Chapter 2 — Automatic speech recognition, the ear
  • §4. Chapter 3 — Text-to-speech, the voice
  • §5. Chapter 4 — The streaming pipeline, the relay race
  • §6. Chapter 5 — End-to-end voice models
  • §7. Retrieval prompts
  • §8. Honest admission
  • §9. Foundation-gap audit
  • §10. Chapter 6 — Recap, interview close, and exercises
  • §11. Bridge forward

§1. ELI5 — The live interpreter at the UN

Imagine a live interpreter sitting inside a United Nations booth. A diplomat speaks in Spanish, and another diplomat waits in English. The interpreter must listen, understand, respond, and keep pace. If the interpreter pauses for two seconds after every sentence, everyone notices. The room feels awkward, mechanical, and slightly broken. That is exactly how voice AI feels when latency becomes visible.

In this story, we will use simple placeholders. | Placeholder | Real system | What it does | |---|---|---| | the ear | ASR / speech-to-text | Hears raw audio and converts it into words | | the brain | LLM processing | Understands intent, reasons, and plans a response | | the voice | TTS / speech synthesis | Converts response text back into natural audio | | the relay race | Streaming pipeline | Moves partial work forward before the full turn ends | | the awkward pause | Latency budget | The visible delay that makes the assistant feel fake |

Now see the flow carefully.

speaker
  -> the ear
  -> the brain
  -> the voice
  -> listener

A child might say, “Why is this difficult?” Because each step is individually hard, and the total must still feel instant. The ear must hear messy accents, room noise, and clipped words. The brain must understand context without taking forever. The voice must sound human without buffering too long. The relay race must move partial results without dropping the baton.

Let us put simple numbers on the awkward pause.

the ear     : 300 ms
the brain   : 350 ms
the voice   : 250 ms
network/jit : 150 ms
---------------------
total       : 1050 ms

One second can still feel acceptable in many settings. Five seconds feels like you lost the call. Text chat hides delay more politely. Voice conversation exposes delay immediately. That is the entire game. We are not only building intelligence. We are building timing.

Keep one sentence in your head throughout this module. Voice AI is a latency engineering problem wrapped around model choices.

§2. Chapter 1 — Opening failure: the five-second assistant

Suppose you build a voice assistant in the most obvious way. You use Whisper for ASR, GPT-4 for thinking, and ElevenLabs for TTS. Each component is excellent in isolation. The demo looks brilliant on a whiteboard. Then the first real call happens.

The user asks a question. Whisper takes one to two seconds. GPT-4 takes one to two seconds. ElevenLabs takes one to two seconds. Your total latency crosses five seconds. Users hang up before the assistant finishes clearing its throat.

This failure is common because engineers accidentally compose great tools serially. Serial composition is deadly in voice.

Naive serial turn

User stops speaking
0ms    1500ms    3000ms    4500ms    5200ms
|------ASR------|------LLM------|------TTS------| play

Now compare that with a streaming design.

Streaming turn

User still speaking
0ms  200ms  450ms  700ms  900ms
|VAD|partial STT|LLM TTFT|TTS first audio|speech

The second system may not be smarter. But it feels dramatically better. Perceived intelligence often follows perceived responsiveness.

Why do users punish delay so harshly in voice? Because conversation has social expectations. Humans naturally overlap breathing, backchannels, and short pauses. A long dead gap feels rude or incompetent. Voice interfaces inherit those social rules immediately.

Now look at the stakes. Voice AI is exploding across customer service, healthcare, assistants, sales, and education. Call centers want lower cost and better routing. Healthcare wants ambient documentation and triage support. Assistants want hands-free interaction in homes, cars, and workplaces. All of these categories punish awkward delay.

A customer will forgive one factual error sooner than repeated silence. A doctor will tolerate an imperfect draft note sooner than a laggy intake flow. A driver cannot stare at a spinner. Realtime changes the product bar.

This is also why averages are misleading. Voice teams obsess over p95 and p99 latency. The bad tail is what callers remember. One laggy turn can poison the next five turns emotionally.

So chapter one gives us the first production law. Do not judge a voice system only by component quality. Judge it by end-to-end turn timing.

And chapter one gives us the second law. Never add the latency of every stage if overlap is possible. Start the next stage before the previous stage is fully complete. That relay-race mindset will keep returning.

§3. Chapter 2 — Automatic speech recognition, the ear

ASR means Automatic Speech Recognition. It converts sound waves into text tokens. In voice systems, the ear is the first gate of trust. If the ear mishears the user, the brain starts from the wrong reality.

What raw audio looks like

Before models, you need simple audio literacy. Audio is just a sequence of numeric samples. Sample rate tells you how many measurements arrive every second. Sixteen kilohertz means sixteen thousand samples per second. That is a common rate for modern speech pipelines. Telephony often drops to eight kilohertz, which hurts clarity.

Chunk size matters because streaming systems do not wait for full files. They send short frames repeatedly. Twenty milliseconds is a common frame size. Fifty milliseconds is still common. Large chunks reduce protocol overhead but increase delay. Tiny chunks reduce delay but increase CPU and network chatter.

Example at 16 kHz mono PCM

20 ms chunk  = 0.02 seconds
samples      = 16,000 * 0.02 = 320 samples
50 ms chunk  = 800 samples
100 ms chunk = 1,600 samples

This is the first audio processing basic you must own. Sample rate changes quality, model compatibility, and bandwidth. Chunk size changes smoothness, buffering, and latency.

Whisper at a high level

Whisper changed the ASR conversation because it generalized very well. At a high level, Whisper takes audio, turns it into spectrogram features, and decodes text. You do not need every architectural detail for interviews. But you should know the shape.

audio waveform
  -> log-Mel spectrogram
  -> encoder
  -> decoder
  -> text tokens + timestamps

Whisper is robust because it was trained broadly and multitask-style. It can handle multilingual input, translation, and timestamp prediction. It is a wonderful baseline. It is also not magically optimized for every realtime situation.

Many teams misuse Whisper by treating a batch-friendly model as a streaming-native system. That is possible, but not free. You need chunking, incremental decoding, and careful buffering around it. Otherwise the ear waits too long before speaking.

Batch ASR versus streaming ASR

Batch ASR waits for the utterance to finish, then transcribes. Streaming ASR consumes audio continuously and emits partial guesses early. For a voice assistant, batch ASR is usually too slow. Streaming ASR is the normal production choice.

Mode Behavior Best use
Batch Wait for full clip, then decode Offline transcription, meeting notes, post-call summaries
Streaming Decode rolling chunks and revise partials Realtime agents, captions, phone assistants

Hosted streaming providers like Deepgram and AssemblyAI exist because latency work is hard. They optimize endpointing, partial stability, timestamps, and infrastructure placement. That is why production teams often buy this piece first.

Partial transcripts and final transcripts

A streaming ASR system usually emits partial hypotheses first. These are provisional words. They may change as more audio arrives. Later, the system emits final words that should no longer change.

Do not treat every partial as final truth. If you do, the LLM will chase unstable text. The assistant may answer a sentence the user never actually said.

A better pattern is this. Maintain a stable prefix and a volatile suffix. Only commit stable text downstream. Display provisional text visually if needed, but tag it as provisional.

Word-level timestamps

Word-level timestamps are deeply useful. They tell you when each decoded word likely occurred. That helps subtitle alignment, search, analytics, and interruption logic. It also helps debug whether the ear or the endpointing logic caused delay.

Suppose the model recognized the final word quickly. But your system still waited seven hundred milliseconds before acting. That points to endpointing, not recognition quality.

VAD — Voice Activity Detection

VAD asks a narrow question. Is someone speaking in this frame, yes or no? That sounds simple. In noisy environments, it is not simple at all.

Background fans, keyboard taps, car noise, and cross-talk all cause pain. A weak VAD fires on noise and causes false starts. An overly strict VAD misses quiet speakers and clipped syllables. Both errors degrade trust.

Popular choices include WebRTC VAD and Silero VAD. WebRTC VAD is old, lightweight, and widely available. Silero is popular because it is accurate and practical. Some hosted vendors also bundle VAD into their streaming stack.

Endpointing — deciding that the user finished

VAD alone does not solve turn taking. VAD tells you speech is present or absent. Endpointing tells you the turn is actually over. Those are related but different decisions.

A classic endpointing rule is silence-based. If silence exceeds some threshold, the turn ends. Six hundred to eight hundred milliseconds is a common starting range. But this rule fails on thoughtful speakers who pause mid-sentence.

A semantic rule is smarter but slower. You ask a tiny model, or the main model, whether the utterance feels complete. This can reduce accidental cutoffs. It can also add latency if done carelessly.

The practical answer is often hybrid endpointing. Use short silence as a candidate signal. Then confirm with semantics, user history, or punctuation stability. This hybrid is one of the highest-leverage decisions in production voice UX.

user: "I need to reschedule my appointment ... tomorrow morning"

wrong endpointing:
"I need to reschedule my appointment"  -> agent interrupts too early

better endpointing:
wait for the rest of the phrase before committing the turn

Accents, jargon, and code-switching

ASR quality is never evenly distributed. Accents, regional speech, background noise, and domain jargon expose the cracks. Medical vocabulary, Indian English, Hinglish, and product names are common failure zones. You must test the populations you actually serve.

This is where humble engineering matters. Do not claim universal quality because a benchmark looked good. Record representative audio. Measure word error rate on your real mix of users. Then measure user-visible latency on top of that.

Latency tactics for the ear

Here are practical levers that usually help. - Keep the streaming connection warm instead of reconnecting per turn. - Use short, regular audio chunks instead of giant buffers. - Co-locate the client, ASR endpoint, and orchestrator region whenever possible. - Prefer streaming-native vendors when latency matters more than self-hosting purity. - Normalize sample rate once, not repeatedly across the stack. - Keep transcripts incremental and stable-prefix aware. - Instrument partial-to-final delay, not only final latency. - Evaluate noisy rooms, bad microphones, and accented speech separately.

A useful production number for streaming ASR finalization is roughly one hundred to three hundred milliseconds after speech ends. The exact number varies by vendor, network, and endpointing strategy. But if you are regularly above five hundred milliseconds here, investigate hard.

Interview answer you should be able to give

If asked about ASR, say this clearly. Whisper is a strong quality baseline. Deepgram and AssemblyAI are common streaming choices. VAD detects speech presence. Endpointing decides turn completion. Word-level timestamps help alignment and debugging. And realtime success depends on partial stability, not only final accuracy.

§4. Chapter 3 — Text-to-speech, the voice

TTS means Text-to-Speech. It converts response text into audible speech. In a voice assistant, this is the personality layer the user actually hears. A smart answer with a delayed or robotic voice still feels bad.

What modern neural TTS changed

Older TTS sounded stitched together and obviously synthetic. Modern neural TTS can sound smooth, expressive, and surprisingly human. The stack details vary by vendor, but the product effect is clear. We moved from intelligible speech to convincing speech.

That improvement changed user expectations. Once the voice sounds human, users expect human rhythm too. So latency matters even more.

Streaming synthesis

Batch TTS waits for the full text, then renders the whole waveform. Streaming TTS starts producing audio before the full sentence is complete. That is essential for realtime agents. The user cares most about first audio, not total rendering time.

This is why you will hear the phrase first-byte latency or first-audio latency. First-byte latency means when the first audio bytes arrive. First-audio latency means when the user can actually hear playback begin. For voice UX, first-audio latency is the emotional truth.

Metric What it means Why it matters
TTFB Time to first audio byte Good transport metric, but still incomplete
TTFA Time to first audible playback Best indicator of perceived responsiveness
Total synthesis time Time for all generated audio Matters less than TTFA for interaction feel

Voice cloning

Voice cloning lets you synthesize a specific speaker identity. This is commercially attractive and ethically dangerous. The technical part is not the only part. Consent, disclosure, policy, and abuse handling are mandatory.

A cloned voice can improve continuity for creators, brands, and accessibility use cases. It can also enable fraud, impersonation, and manipulation. So every voice cloning discussion must include governance.

Emotion and prosody

Prosody is the rhythm, stress, pitch, and pacing of speech. Emotion control affects whether the voice sounds calm, cheerful, serious, or urgent. This matters more than people expect. A correct answer in the wrong tone can still feel wrong.

Healthcare triage needs calm clarity. Sales follow-up may want warmth and speed. Navigation prompts need crisp timing and limited flourish. Prosody is part of product design, not just model capability.

Latency tactics for the voice

  • Keep response text concise, especially for the first sentence.
  • Prefer models and vendors optimized for streaming first audio.
  • Start synthesis at clause boundaries instead of waiting for full essays.
  • Cancel playback aggressively on barge-in instead of finishing old audio.
  • Avoid heavy post-processing that delays audible output.
  • Precompute static prompts, greetings, and hold phrases when appropriate.

A good streaming TTS system often reaches first audio within roughly one hundred to three hundred milliseconds. Sub-hundred-millisecond performance exists, but it is not automatic. Network distance, text length, vendor choices, and playback buffers all matter.

Why punctuation matters more than people think

TTS does not only read words. It interprets structure. Punctuation influences pacing, pausing, and emphasis. Messy LLM output often becomes messy speech output.

Short, well-formed first sentences help two ways. They make synthesis easier. And they make first-audio latency smaller because the first chunk is ready earlier.

Example timing for one response

LLM first tokens available at 420 ms
TTS receives first clause at 470 ms
TTS first audio returns at 620 ms
Client jitter buffer adds 60 ms
User hears speech at 680 ms

That final number is what the user remembers. Not the internal elegance of your architecture.

Interview answer you should be able to give

If asked about TTS, say this calmly. Neural TTS gives natural speech and voice control. Streaming matters because first-audio latency dominates perceived quality. Voice cloning requires consent and governance. Prosody is product behavior, not a cosmetic extra. And barge-in requires fast playback cancellation, not polite waiting.

§5. Chapter 4 — The streaming pipeline, the relay race

Now we connect the ear, the brain, and the voice. This chapter is the heart of the module. A voice assistant is not three APIs stitched together. It is a streaming relay race with timing discipline.

Why the relay race metaphor matters

In a relay race, the next runner starts moving before the baton fully stops. That is how you reduce end-to-end delay. Streaming systems behave the same way. The LLM starts from stable partial text. The TTS starts from early response clauses. No stage waits for perfect completeness if useful partial work exists.

WebSocket basics

A WebSocket is a long-lived, bidirectional connection. Unlike basic request-response HTTP, both sides can send messages whenever ready. That makes it natural for audio chunks, partial transcripts, tokens, and control events.

In voice systems, WebSockets commonly carry events like these. - client audio chunk - VAD state change - partial transcript - final transcript - model token - TTS audio chunk - interrupt / cancel signal - heartbeat / keepalive

This is the second foundation concept you must own. WebSockets reduce reconnection cost and make streaming practical. They do not remove latency by magic. They simply make low-latency flow possible.

A common server architecture

browser / phone client
   <-> WebSocket gateway
   <-> voice orchestrator
   <-> streaming ASR
   <-> LLM
   <-> streaming TTS
   <-> audio output back to client

The orchestrator is the traffic controller. It tracks session state, user turn state, interruptions, and metrics. It should know which transcript is stable, which model call is active, and which audio is currently playing.

Chunked processing

Chunked processing means you move work in small pieces. Audio arrives in frames. Transcripts arrive in partials. Tokens arrive incrementally. Audio replies leave in chunks too.

If you wait for a complete file, paragraph, or waveform, you lose. The product becomes a batch system wearing a headset.

Time to first token and time to first audio

For the brain, an important metric is TTFT. That means Time To First Token. It measures how quickly the LLM begins responding. Short TTFT helps the voice start early.

Prompt length, model size, region choice, and provider load all affect TTFT. This is why prompt discipline matters in voice even more than text. A bloated system prompt may cost you conversational flow.

A practical latency budget

Stage Healthy p95 range What dominates it
End-of-turn detection 200-400 ms Silence threshold and semantic confirmation
STT finalization 100-250 ms Vendor behavior, endpointing, network
LLM TTFT 200-500 ms Model size, prompt length, region, load
TTS first audio 100-300 ms Vendor streaming quality, text chunking
Playback / jitter buffer 50-100 ms Client smoothing and device behavior
Total turn-to-first-audio 700-1200 ms Sum of all weak decisions

These are not laws of physics. They are good engineering expectations for many production settings. If your p95 is far above this band, the user will feel it.

Interruption handling and barge-in

Barge-in means the user starts speaking while the assistant is still talking. Humans do this constantly. A polished agent must support it.

The pipeline should react in this order. 1. Detect new user speech through VAD or input energy. 2. Stop TTS playback immediately, not after the sentence politely ends. 3. Cancel or deprioritize the in-flight model response. 4. Resume streaming ASR for the new utterance. 5. Mark whether the previous assistant turn was partially heard or effectively abandoned.

That fifth step matters for memory and conversation state. If the user interrupted before hearing the key sentence, you may need to restate it later. If the user interrupted after hearing enough, you may continue naturally.

Turn-taking

Turn-taking is the choreography of a voice conversation. It is not just an ASR problem. It combines VAD, endpointing, LLM readiness, TTS interruption, and product policy.

Bad turn-taking creates two visible failures. The agent cuts the user off too early. Or the agent waits too long and feels sleepy. The best systems balance both risks dynamically.

Fast speakers need shorter silence thresholds. Slow speakers need more patience. Question fragments need semantic continuation checks. Phone calls need more resilience to line noise and echo.

State machine thinking

LISTENING
  -> TRANSCRIBING
  -> ENDPOINT_CANDIDATE
  -> THINKING
  -> SPEAKING
  -> INTERRUPTED
  -> LISTENING

Engineers who model these states explicitly debug faster. Otherwise race conditions hide inside ad hoc callbacks. Voice systems are asynchronous by nature. State clarity is survival, not ceremony.

Timing example: serial versus overlapping

Serial path
end-of-turn 350 + STT 220 + LLM 480 + TTS 240 + playback 80 = 1370 ms

Overlapped path
end-of-turn 280 + STT 140 + LLM TTFT 260 + TTS 180 + playback 60 = 920 ms

Notice the difference. The second system is not magic. It just removes unnecessary waiting.

Debugging the pipeline

When a voice assistant feels slow, debug stage by stage. Do not complain vaguely about “latency.” Name the stage. Measure the stage. Then fix the stage.

  • Is endpointing waiting too long on silence?
  • Is STT finalization lagging because of region mismatch?
  • Is the prompt too large, hurting TTFT?
  • Is TTS waiting for long text chunks before starting?
  • Is the client jitter buffer overly conservative?
  • Is network jitter forcing retries or retransmission?

This is the third foundation concept you must own. Latency budgeting means assigning milliseconds to named stages. A budget is not a guess. It is an operating plan.

Production metrics worth logging

  • user speech end detected timestamp
  • ASR final transcript timestamp
  • LLM first token timestamp
  • LLM final token timestamp
  • TTS first audio timestamp
  • playback start timestamp
  • barge-in count per session
  • false endpoint count per session
  • word error rate sample slices by accent and channel

If you do not measure these, you are arguing by vibes. Voice systems punish vibes.

§6. Chapter 5 — End-to-end voice models

So far we discussed the classical cascaded pipeline. Now consider a different idea. What if one model handled listening, reasoning, and speaking together? That is the promise of end-to-end voice models.

Examples include GPT-4o Realtime style APIs and other native speech-to-speech systems. They accept audio directly and can return audio directly. Some also expose text transcripts on the side. But the central abstraction is speech in, speech out.

Why people like them

  • Fewer moving parts in the pipeline
  • Lower coordination overhead between ASR, LLM, and TTS
  • More natural prosody because the system reasons in a speech-native way
  • Better interruption handling in some implementations
  • Simpler developer onboarding for fast prototypes

Why people still hesitate

  • Less observability into where mistakes came from
  • Harder to swap only one layer of the stack
  • Vendor lock-in increases
  • Transcript control may be weaker than a dedicated ASR layer
  • Domain adaptation and compliance constraints may be harder
  • Latency can still be bad if region placement and prompts are poor

In other words, end-to-end does not mean effortless. It reduces one kind of complexity and introduces another kind.

When end-to-end usually wins

It often wins when speed of prototyping matters. It also wins when naturalness matters more than detailed controllability. Consumer assistants, live demos, and lightweight support flows are strong candidates.

When the cascaded pipeline still wins

The classical ASR to LLM to TTS pipeline still wins when you need visibility and modular control. Regulated environments may require explicit transcripts, storage policies, and model isolation. Enterprise teams may want separate vendors for recognition, reasoning, and synthesis. Evaluation teams often prefer modular components because root cause analysis is clearer.

Decision lens End-to-end voice model Cascaded pipeline
Speed to prototype Strong Medium
Fine-grained control Weaker Strong
Observability Often weaker Stronger
Vendor independence Weaker Stronger
Natural speech feel Often strong Depends on integration quality
Compliance customizability Depends on vendor Often stronger

Current limitations also deserve honesty. Accent coverage is still uneven. Low-resource languages remain underserved. Background noise can still derail the conversation. And production debugging is not magically solved.

So do not build a theology here. Use the right abstraction for the job. A senior engineer can defend both choices.

§7. Retrieval prompts

Use these prompts to test whether you actually internalized the module. Do not read the answer first. Answer aloud, then verify against the explainer.

  1. Explain a voice agent to a product manager using the terms the ear, the brain, the voice, the relay race, and the awkward pause.
  2. Your ASR quality is fine, but the agent feels laggy. Walk through a stage-by-stage latency budget and name the first three things you would measure.
  3. Compare streaming ASR with batch ASR for a healthcare triage line. Include word-level timestamps, endpointing, and p95 latency.
  4. When would you choose GPT-4o Realtime or another native speech-to-speech API instead of a cascaded pipeline? Give two reasons for and two against.
  5. Design barge-in handling for a browser voice assistant over WebSockets. Mention client events, server state, and playback cancellation.

§8. Honest admission

Latency is still hard. Even the best teams fight it every week. There is always a tradeoff between quality and speed. A larger model may reason better but miss the conversational moment. A faster model may respond quickly but sound shallow.

Accents and languages are not equally served. Many benchmarks overrepresent clean speech and mainstream accents. Real users do not speak benchmark English into studio microphones. They speak while walking, driving, multitasking, and worrying.

Telephony makes everything worse. Eight-kilohertz audio removes detail. Background noise, packet loss, and echo all become more visible. So a voice agent that feels good in a browser demo may fail on a phone line.

There is also no universal best stack. The right answer depends on channel, regulation, language mix, device power, and product tolerance for errors. That is why this module emphasizes reasoning, not memorization.

§9. Foundation-gap audit

This is the last module in the AI engineering track. So let us audit the foundation gaps explicitly. If any of these feel shaky, patch them now.

Gap 1 — Streaming concepts

You should understand why streaming beats batch for interaction. You should understand partial outputs, backpressure, and cancellation. You should understand that overlapped stages reduce user-visible delay. Mini-check: can you explain why first useful output matters more than total completion time?

Gap 2 — WebSocket basics

You should know that WebSockets are long-lived and bidirectional. You should know why audio chunks fit this model well. You should know that heartbeats, reconnects, and message schemas matter in production. Mini-check: can you name three event types your voice client would send?

Gap 3 — Latency budgeting

You should be able to break one second into named stages. You should know what p50, p95, and p99 mean. You should know that p95 often predicts perceived trust better than averages. Mini-check: can you assign target milliseconds to endpointing, STT, TTFT, TTS, and playback?

Gap 4 — Audio processing basics

You should know sample rate, mono versus stereo, chunk size, and common telephony limits. You should know that resampling too often adds work and can add artifacts. You should know why small chunks help latency but increase coordination cost. Mini-check: at 16 kHz, how many samples arrive in 20 milliseconds?

If you can answer those mini-checks comfortably, your foundation is healthy. If not, revisit this section before pushing deeper into system design interviews.

§10. Chapter 6 — Recap, interview close, and exercises

Now let us convert the story into interview-ready memory. We started with a broken five-second assistant. We end with a design discipline for voice systems.

Failure-to-fix recap

Failure Why it happens Fix
Sequential ASR, LLM, and TTS calls Each stage waits for the previous stage to finish Stream and overlap stages wherever possible
Silence-only endpointing cuts users off Thoughtful speakers pause mid-sentence Use hybrid endpointing with semantic confirmation
Every partial transcript is treated as final Streaming text is unstable early Maintain stable prefix and volatile suffix
TTS waits for the full paragraph The system optimizes total completion, not first audio Start synthesis from early clauses
No barge-in support Playback and listening are treated as separate worlds Interrupt TTS immediately and resume listening fast
One average latency number is reported Tail latency hides inside the average Track p50, p95, and stage-level timings
Browser demo works, phone call fails Telephony audio is worse and noisier Test with 8 kHz, noise, and packet loss conditions
Benchmark speech looks good, real users struggle Accents, jargon, and low-resource languages differ Evaluate on your true user mix
Giant prompts slow the assistant TTFT expands before speech can begin Shrink prompts and cache stable prefixes
Cloned voice ships without governance Product teams focus only on wow factor Require consent, disclosure, and abuse controls

Interview questions you should answer cold

  1. Design a voice agent for a customer support line. Where does your latency budget go?
  2. Whisper versus Deepgram for realtime usage: when would you choose each?
  3. Explain VAD, endpointing, and barge-in without mixing them up.
  4. When would you pick a native speech-to-speech model over a cascaded pipeline?
  5. Your users say the agent feels slow, but your average latency looks fine. What next?
  6. How would telephony constraints change your design?
  7. What governance checks are mandatory before shipping voice cloning?

Production experience numbers worth remembering

These are example numbers, not universal promises. But they are useful interview anchors.

Scenario End-of-turn STT final LLM TTFT TTS first audio Playback Total p95
Browser support bot, good network 260 ms 140 ms 280 ms 160 ms 60 ms 900 ms
Mobile assistant on mixed network 320 ms 180 ms 340 ms 190 ms 80 ms 1110 ms
PSTN call center bridge 380 ms 210 ms 360 ms 220 ms 90 ms 1260 ms
Native speech-to-speech prototype 240 ms included included included 70 ms 650-850 ms

What matters is the pattern. Browser can be very fast. Mobile is more variable. Telephony is harsher. Native speech-to-speech may reduce latency, but it reduces modular control too.

Exercises

  1. Draw a voice pipeline for browser chat and label the millisecond budget at each stage.
  2. Explain to a junior engineer why VAD and endpointing are different decisions.
  3. Record yourself saying a sentence with a mid-thought pause and design endpointing for it.
  4. Compare a cascaded pipeline and an end-to-end voice model for a healthcare intake bot.
  5. Write the event schema for a WebSocket-based voice session.
  6. Decide how you would log and debug barge-in failures.
  7. Pick one underserved accent or language relevant to your users and design an evaluation slice.
  8. Create a one-minute answer to “why does voice AI feel harder than chat?”
  9. Explain why TTFT matters even when the final answer is short.
  10. Define a completion gate for a voice feature before it reaches production.

Final memory hook

If you remember nothing else, remember this chain. The ear hears. The brain interprets. The voice responds. The relay race overlaps them. The awkward pause decides whether the whole thing feels human.

§11. Bridge forward

This completes the AI engineering curriculum. Return to learning/README.md for the system design track and coding exercises. Carry forward the main lesson from this final module. Useful AI systems are not only smart. They are timed, measured, and shaped for the human loop they live inside.