09. Telephony constraints — phone calls humble shiny demos¶
~16 min read. Browser demos sound smooth until the phone network joins the meeting.
Built on the ELI5 in 00-eli5.md. The ear — our name for speech recognition — now meets the harsh, lossy reality of phone networks.
First picture: the phone path damages detail before intelligence begins¶
Look at the path first. A browser mic can send wider and cleaner audio. A phone call often cannot. The channel is narrower, noisier, and more delayed. So the stack starts with a handicap.
caller
│
▼
┌──────────┐ ┌──────────┐ ┌────────────┐ ┌──────────┐
│ handset │──▶│ codec │──▶│ PSTN/SIP │──▶│ AI stack │
└──────────┘ └──────────┘ │ bridge │ └──────────┘
▲ ▲ └────────────┘
│ │
└── noise └── compression, packet loss, jitter, echo
See. By the time audio reaches the ear, some clues are already missing. That is the first senior lesson.
A common browser demo uses 16kHz or better capture. Many phone paths behave like 8kHz narrowband audio. That is roughly half the detail of 16kHz capture. Simple, no? Less detail means weaker consonants, blurrier fricatives, and harder speaker separation.
Now picture a caller saying, "I need to change my booking tomorrow morning." On a laptop mic, that sentence can sound open and crisp. On a phone line, parts of it flatten. The model stack hears less shape. That missing shape matters.
Accented speech increases the difficulty. Accent shifts vowel shape, stress, and timing. Phone noise removes more evidence exactly where ASR needs confidence. So accented speech plus phone noise becomes the worst regular production case. That is why cheerful browser demos collapse on real calls.
And this is not only an ASR problem. If the brain receives damaged text, reasoning starts from a weaker transcript. The LLM may stay smart, but it is now smart over shaky input. That is a very different task.
Codec compression changes what survives the trip¶
The phone line is not one neutral pipe. It is a chain of codec decisions. Those decisions trade bandwidth, cost, compatibility, and quality. Look at the product question first. Which quality survives real call volume? Only then discuss bitrates.
G.711 is common because it is simple and interoperable. It keeps telephony systems talking to each other. But it is still narrowband. Opus can sound much better on modern IP links. Yet the full path may downsample, transcode, or bridge back into older constraints. So what to do? Never assume the best codec survives end to end.
browser mic 16kHz
│
▼
┌──────────┐ transcode ┌──────────┐ bridge ┌──────────┐
│ Opus │──────────────▶│ SIP leg │──────────▶│ G.711 │
└──────────┘ └──────────┘ └──────────┘
│ │
└──────── maybe clean here └── harsh here
Look. Your app may sound excellent on the first leg, then degrade at the bridge. That is why the relay race is a useful picture. One weak runner can lose the whole exchange.
Packet loss adds another problem. A missing packet can clip syllables. Jitter adds uneven arrival times. So systems add jitter buffers to smooth playback. Helpful, yes? But every extra buffer spends more of the awkward pause budget. You reduce choppiness, yet increase delay. That tradeoff appears everywhere in voice engineering.
Echo matters too. If speaker output leaks back into the microphone, playback contaminates new input. Now endpointing gets confused. Barge-in gets messy. Users start speaking over their own delayed playback. Browser demos often hide this because headsets and quiet rooms behave better. Real phone calls do not behave so politely.
Also remember the return path. If the voice sounds warm inside your browser, it may sound flat, nasal, or clipped after telephony compression. So evaluation must include the reply path, not only the transcript path.
Endpointing and turn-taking get harder on phone audio¶
Now imagine the caller pauses for half a beat. Is the person finished? Or is the packet stream just jittery? That question becomes much harder on narrowband audio. The silence detector sees less detail. The acoustic cues arrive less cleanly. And double-talk makes everything worse.
caller speech ── short gap ── more speech
│ │ │
▼ ▼ ▼
maybe real maybe jitter maybe overlap
ending buffer gap with playback
See. A false endpoint steals the turn too early. A missed endpoint leaves users waiting too long. Both outcomes feel broken, just in different ways. That is why telephony destroys naive turn-taking settings.
If the ear fires a final transcript too late, users feel lag. If it fires too early, words get chopped. Then the brain answers the wrong thing with full confidence. That is one of the most expensive failure patterns in production.
And when callers interrupt, playback and capture overlap. Your echo canceller, VAD, and barge-in logic all need harsher tuning. So the phone path is not only lower fidelity. It is a more confusing conversational surface.
PSTN and SIP integration patterns decide latency and control¶
Now picture the AI system as a live interpreter booth. The caller speaks. The network carries that speech. The platform routes it through SIP or PSTN infrastructure. Then your models join the conversation. Each handoff changes control and delay.
A common pattern is carrier to SIP trunk to media gateway to realtime service. Another is carrier to hosted bot platform to your business logic. The first can give tighter control over the relay race. The second can speed up adoption, but may hide buffering and bridge delays.
caller
│
▼
┌──────────┐ ┌──────────┐ ┌────────────┐ ┌────────────┐
│ PSTN │──▶│ carrier │──▶│ SIP bridge │──▶│ realtime AI│
└──────────┘ └──────────┘ └────────────┘ └────────────┘
│
├── ASR
├── LLM
└── TTS
Look. PSTN bridges add latency even before model work starts. Signaling hops take time. Media gateways buffer. Transcoding takes time. Recording systems and compliance layers may add more. So when someone says, "Our browser demo answered in 700 milliseconds," the senior reply is, "Yes, but what about the phone path?"
SIP-native calls are often friendlier than PSTN calls. You may keep widerband audio longer. You may preserve metadata better. You may control routing more directly. Still, carrier behavior varies. Enterprise telephony setups insert surprises. Never trust one clean office network as proof.
And remember this. The awkward pause accumulates across capture, transport, buffering, ASR, LLM, TTS, and playback. Latency is a stack, not one number.
Also remember this. The voice must travel back through the same constrained channel. If synthesis sounds polished in the browser, telephony may crush the prosody. That is why playback evaluation belongs in every telephony review.
Testing must recreate telephony pain, not avoid it¶
This is where many teams lose honesty. They test on office Wi-Fi, clean USB headsets, and one friendly accent. Then they deploy to call centers. Now the real world attacks.
Phone testing should include: - 8kHz or narrowband captures - G.711 and Opus legs - packet loss scenarios - jitter and buffer variation - echo on speakerphone devices - accented speech with background noise - barge-in and double-talk cases - real carrier or sandbox PSTN paths
Simple, no? If you skip these, you are not testing voice AI. You are testing a lab toy.
A strong workflow is staged. Start in the browser for developer speed. Move to SIP test numbers next. Then run through carrier and PSTN bridges. Finally, test with noisy environments, different accents, and realistic hold-music leakage. Compare results side by side. Do not argue from memory.
Call-center economics drive adoption because automation can save serious operating cost. But those same economics punish bad quality fast. Every failed minute costs money, agent time, and customer trust. So the business case pulls voice AI forward. The phone network pushes quality backward. That tension is the whole game.
So what to do? Treat telephony as its own product surface. Make a separate latency budget for phone calls. Measure ASR error by accent slice. Review recordings with headphones. Compare browser, mobile, and PSTN paths side by side. And keep repeating the core lesson. The relay race only works when every stage survives real network stress. The brain cannot rescue everything lost upstream. The voice cannot impress anyone if the channel crushes the reply.
Where this lives in the wild¶
- Genesys Cloud voice bot deployment — solutions engineer: narrowband call audio decides whether containment targets are realistic.
- Twilio Flex assistant rollout — voice AI engineer: SIP trunks and PSTN bridges change measured latency versus browser prototypes.
- Airline IVR modernization — telephony architect: accented callers on noisy lines expose ASR weaknesses immediately.
- Bank collections bot program — platform engineer: call-center cost pressure pushes automation while compliance layers add delay.
- Healthcare appointment hotline — conversation designer: phone playback quality shapes prompt pacing and retry wording.
Pause and recall¶
- Why does 8kHz telephony remove useful clues for speech recognition?
- Why can a great Opus demo still sound mediocre in a real phone call?
- Where does latency grow before model inference even starts?
- Why must telephony evaluation include accented speech under noisy conditions?
Interview Q&A¶
Q: Why is browser voice performance a bad proxy for phone-call performance? A: Browser tests often use cleaner audio, wider sampling, lower echo, and fewer bridge hops than real telephony paths. Common wrong answer to avoid: "Because browsers have better models than phones."
Q: Why do PSTN bridges often hurt realtime latency? A: They add signaling, transcoding, buffering, and carrier handoffs before and after model stages. Common wrong answer to avoid: "Only the LLM matters for latency, so bridges are minor."
Q: Why are accented callers especially vulnerable on telephony channels? A: Narrowband audio and noise remove phonetic detail that ASR already needs for robust accent coverage. Common wrong answer to avoid: "A strong ASR model makes accent problems disappear automatically."
Q: How should a team test a voice assistant meant for call centers? A: Test through real SIP and PSTN paths with packet loss, jitter, echo, noisy backgrounds, and diverse accents. Common wrong answer to avoid: "If it works on laptop microphones in the office, deployment risk is low."
Apply now (5 min)¶
Exercise. Take one browser voice demo you know. List three ways a phone call would make it worse. Then mark which problem hits ASR, which hits latency, and which hits playback quality.
Sketch from memory. Draw the call path from caller to carrier to bridge to AI stack. Mark one place where audio quality drops. Mark one place where the model still cannot recover lost information.
Bridge. Telephony exposes failures very quickly. Next, learn how to measure each stage and debug those failures systematically. → 10-evaluation-debugging.md