05. Per-tensor vs per-channel — one ruler for all, or one ruler per row¶

~12 min read. The thing that decides whether quiet channels keep their detail.

Built on the ELI5 in 00-eli5.md. The field notes — the compact quantized weights — can use one ruler for the whole page or local rulers for each channel when fine detail would otherwise disappear.

1) Start with the ruler picture¶

See.

Per-tensor quantization uses one scale for the whole tensor. Per-channel quantization uses a separate scale for each row or channel. That is the whole contrast. If all channels live at similar magnitudes, one ruler may be fine. If one channel is loud and another is quiet, one ruler becomes unfair. The loud channel chooses the scale. The quiet channel pays the price. Think of shared field notes. One ruler for the whole notebook erases delicate sketches. One ruler per chapter protects them. Here is the picture.

whole tensor  ── one scale ──→ simple, cheap, rough
one channel   ── own scale ─→ more metadata, much better local fit
groups        ── few scales ─→ middle ground

Simple, no?

2) Concrete example with two very different rows¶

Take this 2 × 4 weight matrix. Row A is small. Row B is large.

Row A = [ 0.05, -0.08,  0.12, -0.15 ]
Row B = [ 2.40, -3.10,  1.80, -2.70 ]

Use symmetric int4 with bucket range [-7, 7]. First try per-tensor. The largest magnitude in the whole matrix is 3.10. So the shared scale is 3.10 / 7 = 0.4429. Now quantize Row A with that shared scale. 0.05 / 0.4429 = 0.11 → 0 -0.08 / 0.4429 = -0.18 → 0 0.12 / 0.4429 = 0.27 → 0 -0.15 / 0.4429 = -0.34 → 0 Ouch.

The entire small row collapses to zero. Row B survives much better because it set the ruler. That is the failure mode. The quiet row did nothing wrong. It simply lost the scale fight.

3) Now give each row its own scale¶

Row A max magnitude is 0.15. So Row A scale is 0.15 / 7 = 0.0214. Row B max magnitude is 3.10. So Row B scale stays 3.10 / 7 = 0.4429. Now quantize Row A again. 0.05 / 0.0214 = 2.33 → 2 -0.08 / 0.0214 = -3.73 → -4 0.12 / 0.0214 = 5.60 → 6 -0.15 / 0.0214 = -7.00 → -7 Reconstruct Row A. [0.0429, -0.0857, 0.1286, -0.1500] Now the quiet row lives. The values are not perfect. But they are recognizably correct. That is the whole win. Local ruler. Local survival. Yes?

4) Put the error comparison in one table¶

Method	Row A reconstructed	Row A MAE	Row B reconstructed	Row B MAE
Per-tensor	`[0, 0, 0, 0]`	0.10	`[2.21, -3.10, 1.77, -2.66]`	0.09
Per-channel	`[0.04, -0.09, 0.13, -0.15]`	0.005	`[2.21, -3.10, 1.77, -2.66]`	0.09
Look at Row A carefully.
Per-tensor MAE is about `0.10`.
Per-channel MAE is about `0.005`.
That is a dramatic difference.
And Row B did not get worse.
This is why per-channel usually wins for weight quantization.
Different channels live at different natural scales.
One shared ruler is often too blunt.
The shared field notes erase the quiet chapters first.
See the mental sketch.

shared scale
Row A ── tiny values ──► 0 0 0 0
Row B ── large values ─► survives
local scales
Row A ── own ruler ────► survives
Row B ── own ruler ────► survives

5) Group-wise quantization is the compromise move¶

Now the practical question. Why not give every tiny slice its own scale forever? Because scales themselves cost metadata. Because kernels like regular structure. Because memory access patterns matter. Because engineering is always trade-off. So many systems choose a middle path. Group-wise quantization. Instead of one scale for the whole tensor, you use one scale per small block. Maybe 32 columns share a scale. Maybe 64 weights do. Maybe one output channel does. This is the compromise. Better local fit than per-tensor. Less metadata than extremely fine granularity. Easier kernels than fully custom scaling everywhere. That is why group-wise schemes keep showing up in real tools.

6) Why this matters in real serving systems¶

Quantization is not only about shrinking bytes. It is about shrinking bytes without deleting useful structure. If small channels hold important corrections, you do not want them flattened. If outlier rows dominate the scale, you want local rescue. That is why better quantizers keep rediscovering the same truth. Channels are not naturally identical. They learn different feature strengths. They live at different magnitudes. So what to do? Start with per-channel or group-wise for weights unless you have a strong reason not to. Then benchmark quality, memory, and latency together. Not one by one. See.

Where this lives in the wild¶

PyTorch per-channel weight observers — linear and convolution layers often keep separate scales for each output channel.
TensorRT-LLM int8 pipelines — local scales preserve accuracy while shrinking production checkpoints.
ONNX Runtime per-channel quantization — deployment graphs use channel-wise scales for better weight reconstruction.
llama.cpp group-wise GGUF formats — groups of weights share scales to balance quality and compact storage.
vLLM AWQ checkpoints — activation-aware weighting protects important channels with smarter local scaling.

Pause and recall¶

What is the core difference between per-tensor and per-channel quantization?
Why did Row A collapse to zeros under the shared scale?
Why did per-channel rescue Row A without hurting Row B much?
What trade-off does group-wise quantization try to balance?

Interview Q&A¶

Q1. Why prefer per-channel quantization over per-tensor quantization for many weight matrices? A1. Because different channels often live at different magnitudes, and one shared scale can erase small but important channels.

Common wrong answer to avoid: "Because per-channel removes quantization error entirely."

Q2. Why use group-wise quantization instead of pure per-channel everywhere? A2. Because group-wise scaling keeps much of the accuracy benefit while reducing metadata and kernel complexity.

Common wrong answer to avoid: "Because groups are mathematically exact and channels are not."

Q3. Why accept extra scale metadata instead of saving every possible byte? A3. Because a small amount of metadata can protect a much larger amount of model quality.

Common wrong answer to avoid: "Scale metadata is always wasted memory."

Q4. Why do channels naturally need different scales instead of sharing one global ruler? A4. Because learned rows and channels often represent very different feature strengths and therefore very different value ranges.

Common wrong answer to avoid: "All channels should have similar magnitudes in a well-trained model."

Apply now (5 min)¶

Quick exercise. Write two short rows with very different magnitudes. Compute one shared int4 scale for both rows. Then compute one scale per row. Compare the reconstructed values by eye. Sketch from memory this ladder: one tensor scale → small row collapses one channel scale → small row survives Under it, write one sentence on why shared field notes can erase quiet details.

Bridge. Good. Even per-channel is still mostly clever rounding. Next we study smarter rounding that asks which weights matter most. → 06-gptq.md