Skip to content

02. Audio Processing Basics — Learn the raw signal first

~15 min read. Before fixing latency, understand what the pipeline is actually moving.

Built on the ELI5 in 00-eli5.md. The ear — the part that hears sound and turns it into words — only works well when the raw audio is shaped correctly.


1) Audio is just pressure turned into numbers

Look. Your microphone hears changing air pressure. The computer cannot store air pressure directly. So it stores snapshots of that wave as numbers. Each snapshot is called a sample. A sample rate tells us how many snapshots arrive each second. At 16kHz, we take 16,000 samples every second. That is a common choice for speech systems. It is usually enough for clear voice recognition. It is also light enough for realtime work. Simple, no?

The first helpful picture is this.

sound in room
┌──────────────┐   ┌──────────────┐   ┌──────────────┐
│ microphone   │─→│ PCM samples  │─→│ the ear      │
│ air pressure │   │ numbers over │   │ speech model │
└──────────────┘   │ time         │   └──────────────┘
                   └──────────────┘

Picture first. Math second. If one second contains 16,000 samples, then 20 milliseconds contains 320 samples. That little fact appears everywhere in voice engineering. Why 320? Because 16,000 multiplied by 0.02 equals 320. So a 20ms chunk at 16kHz holds 320 samples. When you hear engineers discuss frame size, this is often what they mean. The relay race depends on these small chunks moving steadily. If the chunks are too large, the ear starts late. If the chunks are too tiny, overhead starts eating your gains. So what to do? Understand the units first. Then tune them against the awkward pause.


2) The format choices hiding under every API call

Speech systems often prefer mono audio. Mono means one channel. Stereo means two channels, left and right. For music, stereo matters a lot. For voice assistants, mono is usually enough. Why carry two channels if the ear only needs one speaker stream? Mono reduces bandwidth and processing cost immediately. That is why phone audio and many ASR pipelines stay mono.

Now look at PCM. PCM means pulse-code modulation. That sounds fancy. The practical meaning is simple. We store raw sample values directly, without fancy compression inside the frame. A common setup is 16-bit PCM. That means each sample uses 16 bits, or 2 bytes. So raw 16kHz mono audio costs about 32KB per second. That is not free. But it is still manageable for many realtime paths. If you switch to stereo, the cost doubles. If you raise the sample rate, the cost rises again. The voice may sound richer, but the relay race gets heavier.

See this ladder.

┌────────────────────────────┐
│ sample rate: 8000 or 16000 │
├────────────────────────────┤
│ channels: mono or stereo   │
├────────────────────────────┤
│ sample width: 8 or 16 bit  │
├────────────────────────────┤
│ container: raw PCM / WAV   │
└────────────────────────────┘

Every rung changes cost, quality, or compatibility. WAV often wraps PCM with headers. Raw PCM skips headers and stays simple for streams. Some APIs want little-endian PCM bytes specifically. Some browsers give you float samples first. Some telephony gateways hand you compressed μ-law instead. So engineers must know the format contract, not only the model name. The ear cannot fix garbage framing later. The brain cannot reason over broken packets. The awkward pause can start from one wrong format choice.


3) Chunk size, telephony limits, and resampling pain

Chunk size is one of the most practical knobs. A 20ms chunk at 16kHz contains 320 samples. A 40ms chunk contains 640 samples. A 100ms chunk contains 1,600 samples. Small chunks help latency because the next stage sees data sooner. That helps the ear emit partial text sooner. That helps the relay race begin sooner. So smaller looks better at first glance.

But small chunks are not a free lunch. More chunks mean more packets. More packets mean more framing overhead. More callbacks mean more CPU scheduling work. More network events mean more chances for jitter. So there is a sweet spot. Many systems live around 20ms or 40ms chunks. That keeps the awkward pause low without drowning in overhead.

Now meet telephony reality. Classic phone audio often arrives at 8kHz. That means only 8,000 samples each second. The signal is narrower. Many higher frequencies disappear. Human conversation still works. But ASR quality often drops on accents, soft consonants, and noisy lines. Yes? Phone lines are harsh teachers. They expose model weakness fast. If your product must work on calls, test on 8kHz early. Do not polish only studio audio.

Resampling is the bridge between incompatible worlds. Suppose your upstream audio arrives at 48kHz. Your ASR model expects 16kHz. Now you must downsample cleanly. Suppose your phone provider sends 8kHz μ-law. Your model wants 16kHz PCM. Now you decode and upsample. Upsampling does not create missing detail magically. It only makes the format acceptable to the model. That subtle point matters. Bad resampling adds artifacts, delay, or both. Good resampling is boring, invisible, and worth caring about.


4) Spectrograms, bandwidth, and debugging intuition

Engineers should know one visual tool well. That tool is the spectrogram. A waveform shows amplitude over time. A spectrogram shows energy across time and frequency. Picture it like a heat map for sound. Bright bands mean stronger energy. Dark gaps mean silence or missing detail. You do not need heavy math here. You need the picture.

frequency ▲
          │  ████      ███
          │ ██████    ████
          │ ███████  █████
          │   ██        ██
          └──────────────────▶ time

Look at a noisy file on a spectrogram. You may see constant broadband fuzz. Look at clipped audio. You may see flattened energy patterns. Look at telephony audio. You may see less high-frequency content. That picture helps explain why the ear misses certain words. It also helps explain why the voice sounds thin later. Picture before math again.

Bandwidth and format decisions live here too. Compressed transport saves network cost. Raw PCM saves decoding complexity. Browser capture may start as float arrays. Server inference may want signed 16-bit integers. Storage may prefer FLAC for archival quality. Live streaming may prefer small binary frames. None of these choices is abstract. They change cost, latency, and failure modes. The ear cares. The relay race cares. The awkward pause definitely cares. So what to do? Know the source format, know the target format, and count conversion steps honestly. Every conversion is a place where delay or distortion can enter.


Where this lives in the wild

  • Contact-center voice bot — backend engineer: normalize many audio formats before sending one clean stream into ASR.
  • Language tutor app — mobile engineer: choose chunk size that feels fast without burning battery badly.
  • Telehealth triage line — applied scientist: measure how 8kHz phone audio hurts symptom recognition.
  • Meeting assistant — platform engineer: mix or split stereo feeds depending on diarization needs.
  • In-store kiosk assistant — edge engineer: resample noisy microphone input consistently on cheap hardware.

Pause and recall

  1. What does 16kHz actually mean in simple terms?
  2. Why does a 20ms chunk at 16kHz contain 320 samples?
  3. Why do smaller chunks help latency but also create overhead?
  4. Why does upsampling 8kHz audio not magically restore lost detail?

Interview Q&A

Q: Why is mono audio usually enough for voice assistants? A: Because speech understanding often needs one clear channel, while mono cuts bandwidth and compute compared with stereo. Common wrong answer to avoid: Stereo is always better, so use it everywhere.

Q: What practical trade-off does chunk size control? A: It balances responsiveness against overhead. Small chunks help the ear and the relay race start sooner, but they create more events and framing cost. Common wrong answer to avoid: Smaller chunks are always strictly better.

Q: Why do telephony systems often feel harsher than browser demos? A: Phone audio is often 8kHz and compressed, so detail is missing and noise handling becomes harder. Common wrong answer to avoid: The model became worse only because production traffic is larger.

Q: What is the engineer-friendly use of a spectrogram? A: It gives a picture of where energy sits over time and frequency, which helps diagnose noise, clipping, and narrow-band audio. Common wrong answer to avoid: Spectrograms are only for researchers doing advanced DSP.


Apply now (5 min)

Exercise: Take one second of 16kHz mono 16-bit PCM. Compute the sample count and the raw byte count. Then repeat for 20ms. Write both answers without looking back.

Sketch from memory: Draw the path from microphone to PCM chunks to the ear. Label sample rate, chunk size, mono, and the awkward pause. If your sketch misses units, fix that first.


Bridge. Now that the raw signal is clear, learn how the ear turns it into live text. → 03-streaming-asr.md