02. Number formats — how the same weight wears different clothing¶
~12 min read. The thing that decides how much ink and paper each number gets.
Built on the ELI5 in 00-eli5.md. The blueprint — the full-precision weights — is the original drawing, and each format decides how much paper it uses and how much detail it keeps.
A number format is a storage budget¶
See. A format answers two questions. How much detail can I keep? How wide a range can I cover? That is it. If a format keeps tiny differences well, it has strong precision. If a format survives very small and very large values, it has strong range. The original blueprint is often saved in fp32 or fp16. Those formats spend many bits on flexible values. Integer formats are stricter. They use fixed buckets. Then a scale tells you what each bucket means. So floating point is dynamic. Integer is rigid. Floating point says, "Move the decimal idea around." Integer says, "Pick a ruler and snap to marks." Both are useful. They simply solve different problems.
Floating point keeps a moving decimal point¶
Floating point splits bits into parts. One sign bit says positive or negative. Exponent bits control range. Mantissa bits control local detail. Here is the picture.
fp32 ┌─sign─┬────exponent────┬────────────mantissa────────────┐
│ 1 b │ 8 b │ 23 b │
└──────┴────────────────┴────────────────────────────────┘
fp16 ┌─sign─┬──exponent──┬────mantissa────┐
│ 1 b │ 5 b │ 10 b │
└──────┴────────────┴────────────────┘
bf16 ┌─sign─┬────exponent────┬─mantissa─┐
│ 1 b │ 8 b │ 7 b │
└──────┴────────────────┴──────────┘
Look at fp16 and bf16 carefully.
They both use 16 bits total.
But they spend those bits differently.
fp16 spends more on mantissa.
bf16 spends more on exponent.
So fp16 gives finer nearby detail.
bf16 gives much wider travel distance.
Same suitcase size.
Different packing plan.
Integer formats use fixed buckets plus a scale¶
Integer formats are simpler to picture.
The bits mostly encode bucket IDs.
Then a scale maps bucket back to a real value.
With signed int8, the raw bucket range is usually -128 to 127.
With signed int4, the raw bucket range is often -8 to 7 or -7 to 7.
Here is the mental model.
real number ──round with scale──→ integer bucket ──multiply by scale──→ reconstructed value
int8 buckets: ... -3 -2 -1 0 1 2 3 ...
int4 buckets: -3 -2 -1 0 1 2 3
Integer is cheap in memory. Integer is often good for inference. But integer is not self-scaling. If one tensor has huge values and tiny values together, a single ruler hurts someone. That is why quantization schemes keep talking about scale granularity. We will hit that soon. For now, remember the contrast. Floating point carries its own zoom system. Integer needs an external ruler.
A simple table to hold in your head¶
| Format | Bits | Structure | Strength | Weakness | Typical use |
|---|---|---|---|---|---|
| fp32 | 32 | sign + 8 exp + 23 mantissa | high precision and wide range | large memory | reference weights, stable training paths |
| fp16 | 16 | sign + 5 exp + 10 mantissa | good local precision, half the memory of fp32 | narrower range | mixed precision inference and some training |
| bf16 | 16 | sign + 8 exp + 7 mantissa | wide range close to fp32 | coarser nearby detail | modern LLM training |
| int8 | 8 | integer bucket + scale | small memory, fast inference | needs calibration and scale handling | post-training quantized serving |
| int4 | 4 | tiny bucket set + scale | very small memory | more rounding error | aggressive serving, QLoRA base weights |
| That table is not trivia. | |||||
| It is deployment language. | |||||
| When someone says, "Use int4," ask immediately, | |||||
| "For storage, for compute, or both?" | |||||
| When someone says, "Use bf16," ask, | |||||
| "Because of range, or because the hardware path is best?" | |||||
| Simple, no? |
One worked numerical example shows the difference¶
Suppose a tensor has 2,000,000 weights.
How much raw storage do different formats need?
fp32 uses 4 bytes each.
So 2,000,000 × 4 = 8,000,000 bytes.
That is about 8 MB in decimal terms.
fp16 or bf16 use 2 bytes each.
So the same tensor becomes about 4 MB.
int8 uses 1 byte each.
So it becomes about 2 MB.
int4 uses half a byte each.
So it becomes about 1 MB.
Same tensor.
Different storage budget.
Now one value example.
Say the real weight is 1.30.
If an int8 scale is 0.10, the stored bucket is 13.
Reconstructed value is 13 × 0.10 = 1.30.
Nice.
But int4 cannot store bucket 13.
So int4 needs a coarser scale, like 0.20.
Then the bucket becomes round(1.30 / 0.20) = 7.
Reconstructed value is 7 × 0.20 = 1.40.
Smaller format.
More distortion.
This is why shrinking the blueprint saves paper but can blur the lines.
Choosing the format is really choosing the failure mode¶
fp32 fails by being heavy. fp16 fails by having limited range. bf16 fails by having coarser local precision. int8 fails by needing good scales and calibration. int4 fails by introducing even more rounding pressure. So what to do? Match the format to the stage. Training usually values safe range. Serving usually values smaller memory and faster movement. Debugging often wants fp32 reference paths. The important move is not memorizing names. The important move is seeing trade-offs. Yes?
Where this lives in the wild¶
-
NVIDIA H100 Tensor Cores — teams choose fp16 or bf16 execution paths based on training stability and throughput.
-
Google TPU training stacks — bf16 is common because wide range matters for large-model training.
-
PyTorch AMP autocast — developers switch between fp16 and bf16 mixed precision without changing model logic.
-
llama.cpp GGUF Q4_K_M — local inference stores weights in very small formats to fit laptop memory.
-
bitsandbytes 8-bit and 4-bit loaders — serving and finetuning workflows shrink checkpoints before they touch the GPU.
Pause and recall¶
-
What two questions does every number format answer?
-
Why is floating point called dynamic and integer called rigid?
-
How do fp16 and bf16 spend the same
16bits differently? -
Why can int4 save memory but distort values more than int8?
Interview Q&A¶
Q1. Why use bf16 instead of fp16 for many training jobs? A1. Because bf16 keeps a much wider exponent range, so large and tiny training values are less likely to overflow or vanish. Common wrong answer to avoid: "Because bf16 is always more precise than fp16."
Q2. Why use int8 or int4 for serving instead of leaving everything in fp32? A2. Because serving is often memory-bound and bandwidth-bound, so smaller formats can unlock fit, throughput, and cost improvements. Common wrong answer to avoid: "Because lower bits automatically make the model smarter and faster in every way."
Q3. Why keep floating point around instead of making the whole stack integer? A3. Because activations, gradients, and scaling behavior often need the dynamic range that floating point provides. Common wrong answer to avoid: "Integer math is always enough if the model is large enough."
Q4. Why choose int8 instead of int4 in some deployments? A4. Because int8 usually keeps more quality margin and simpler kernel behavior while still cutting memory sharply. Common wrong answer to avoid: "If int4 is smaller, it is automatically the best production choice."
Apply now (5 min)¶
Quick exercise.
Take a fake tensor with 10,000,000 weights.
Compute the raw storage for fp32, fp16, bf16, int8, and int4.
Then write one sentence on which format you would pick for training and which for serving.
Sketch from memory the bit layouts for fp32, fp16, and bf16.
Under the sketch, explain why shrinking the blueprint changes both paper usage and detail.
Bridge. Good. We now know the suitcases. Next we study the most confusing pair: two formats with the same
16bits but very different trade-offs. → 03-precision-vs-range.md