04. Quantization core — snapping rich numbers into tiny buckets¶

~13 min read. The thing that turns the detailed blueprint into compact field notes.

Built on the ELI5 in 00-eli5.md. The field notes — the compact quantized weights — rewrite the blueprint with fewer marks, so some rounding error is inevitable.

1) Picture first, formula second¶

Look.

Quantization is controlled snapping. You start with a float weight. You choose a scale. You divide by that scale. You round to a small integer bucket. Later you multiply back by the same scale. That is the whole trick. The detailed blueprint becomes lighter field notes. Useful. Compact. A little blurry. Here is the picture.

float weight ──divide by scale──→ bucket id ──round──→ q
      q ──multiply by scale──→ reconstructed weight

So what gets lost? The difference between the original float and the reconstructed float. That gap is the rounding error. Simple, no?

2) The tiny formula that runs the whole game¶

For one weight w and one scale s: q = round(w / s) w_hat = q × s q is the stored integer bucket. w_hat is the reconstructed float. If we use symmetric int4, a common bucket set is [-7, 7]. So the scale usually comes from the biggest magnitude value. For simple symmetric scaling: s = max(|w|) / 7 Why 7? Because 7 is the largest positive bucket. So the biggest value gets mapped to the edge. Cheap rule. Easy rule. Also a dangerous rule. Why dangerous? Because one ruler now covers everything. Tiny values must share the same coarse step as the largest value. That is where trouble starts.

3) Worked example with symmetric int4¶

Quantize this vector: [0.18, -0.91, 1.62, 2.94] Use bucket range [-7, 7]. Step one is the scale. Largest magnitude is 2.94. So s = 2.94 / 7 = 0.42. Step two is division. 0.18 / 0.42 = 0.4286 -0.91 / 0.42 = -2.1667 1.62 / 0.42 = 3.8571 2.94 / 0.42 = 7.0000 Step three is rounding. q = [0, -2, 4, 7] Step four is reconstruction. w_hat = [0.00, -0.84, 1.68, 2.94] Now put everything in one table. | original w | scaled w/s | stored q | reconstructed w_hat | absolute error | |---:|---:|---:|---:|---:| | 0.18 | 0.4286 | 0 | 0.00 | 0.18 | | -0.91 | -2.1667 | -2 | -0.84 | 0.07 | | 1.62 | 3.8571 | 4 | 1.68 | 0.06 | | 2.94 | 7.0000 | 7 | 2.94 | 0.00 | See the pattern. The largest value survives perfectly. The smallest value suffers most. That is not an accident. That is the geometry of one shared ruler.

4) Why small values get crushed first¶

The scale had to protect 2.94. So one bucket step became 0.42. That step is huge compared with 0.18. The value 0.18 is less than half one step. So it rounds to zero. That is the heart of the problem. One global ruler must cover the giant value. Quiet values lose detail first. The field notes stay compact. But delicate strokes disappear. See the bucket view.

int4 buckets with s = 0.42
┌────┬────┬────┬────┬────┬────┬────┬────┬────┐
│-7  │-6  │-5  │-4  │-3  │-2  │-1  │ 0  │...│
└────┴────┴────┴────┴────┴────┴────┴────┴────┘
 ↓    ↓    ↓    ↓    ↓    ↓    ↓    ↓
-2.94     ...       -0.84 -0.42 0.00 0.42 ... 2.94

Now the pain is visible. Anything small enough near zero gets absorbed. That absorbed detail becomes the rounding error. Not mystical. Just bucket spacing. Yes?

A common confusion appears here. Do we keep the whole model in tiny integers for every operation? Often, no. Many systems store weights in low bits. Then they dequantize blocks during matrix math. Why? Because scales must be applied somewhere. Because accumulations need wider working precision. Because kernels still like stable arithmetic. So low-bit storage is one decision. Low-bit compute is another decision. They are cousins. Not twins. This matters in practice. A checkpoint can fit because storage shrank. The matmul can still use wider accumulators. That is normal.

6) The whole field is really about choosing the ruler better¶

Once you see 0.18 collapse to 0.00, the next question becomes obvious. Why use one scale for everything? Exactly. That leads to per-channel quantization. That leads to group-wise quantization. That leads to GPTQ and AWQ. All of them still depend on the same engine. q = round(w / s) w_hat = q × s The advanced tricks mostly change how s is chosen, where s is shared, and how the rounding error is distributed. So own the tiny formula. If you own that, the rest becomes variations. See.

Where this lives in the wild¶

bitsandbytes 4-bit QLoRA loaders — frozen base weights are stored compactly while adapters stay trainable.
TensorRT-LLM quantized serving — low-bit checkpoint storage reduces memory pressure for production inference.
llama.cpp GGUF Q4_K_M — laptop inference becomes practical by snapping weights into compact buckets.
vLLM AWQ and GPTQ checkpoints — larger models fit fixed GPU budgets by storing weights in fewer bits.
Apple MLX quantized models — local Mac inference benefits from aggressive weight compression.

Pause and recall¶

What do q = round(w / s) and w_hat = q × s mean in plain language?
Why does the largest value decide the scale in simple symmetric quantization?
In the worked example, why did 0.18 become 0.00 after reconstruction?
What exactly is the rounding error?

Interview Q&A¶

Q1. Why store a quantized bucket plus scale instead of storing the raw float directly? A1. Because the bucket uses far fewer bits, and the scale lets you reconstruct an approximate float when needed.

Common wrong answer to avoid: "Because integers are exact replacements for real numbers."

Q2. Why use symmetric int4 buckets for this teaching example instead of arbitrary custom bucket values? A2. Because symmetric buckets are simple, hardware-friendly, and match the centered shape of many weight distributions.

Common wrong answer to avoid: "Because symmetry removes quantization error."

Q3. Why does one global scale hurt small values more than large values? A3. Because the step size is chosen to protect the largest magnitude, so smaller values must live on a coarse ruler.

Common wrong answer to avoid: "Because small values never matter to the model."

Q4. Why dequantize during computation instead of keeping every operation in int4? A4. Because scales, accumulations, and kernel behavior often need wider working precision even when storage is low-bit.

Common wrong answer to avoid: "Once the checkpoint is int4, every useful computation should also stay int4."

Apply now (5 min)¶

Quick exercise. Take the vector [0.30, -1.10, 2.20]. Use symmetric int4 with range [-7, 7]. Compute the scale. Compute the buckets. Compute the reconstructed values. Sketch from memory the four-step flow: float → divide by scale → round to bucket → multiply back Under it, write one sentence on how the field notes save memory but introduce the rounding error.

Bridge. Good. One scale for everything is easy, but it crushes quiet values. The next fix is local scales. → 05-per-tensor-vs-per-channel.md