03. Precision vs range — why bf16 survives the rough road¶
~12 min read. The thing that explains why the safer copy is not always the sharper copy.
Built on the ELI5 in 00-eli5.md. The blueprint — the original full-precision weights — can be copied with different ink densities, and range decides whether the copy survives transport at all.
1) Same 16 bits, two different promises¶
Look.
People hear fp16 and bf16 and think, "Both are 16-bit, so almost same."
Not same.
The suitcase size is same.
The packing plan is different.
Precision means this.
How finely can you separate nearby values?
Range means this.
How far outward can you travel before overflow or underflow hurts you?
fp16 spends more of its budget on mantissa detail.
bf16 spends more of its budget on exponent range.
So fp16 is often finer near one local neighborhood.
bf16 is much safer across wild scales.
That single sentence removes many interview mistakes.
Do not say, "bf16 is more accurate."
Say, "bf16 is usually safer."
That is the point.
Training cares about survival.
Activations jump.
Gradients shrink.
Loss scaling changes the weather again.
The copied blueprint must survive that trip.
Simple, no?
2) First picture: what happens near 1.0¶
Near 1.0, the question is local spacing.
If the buckets are close together, you keep fine detail.
If the buckets are far apart, tiny differences blur.
See the cartoon.
near 1.0
int4 : ─────0.86────────1.14────────1.43─────
bf16 : ──0.98────1.00────1.02────1.04───────
fp16 : ─0.99─1.00─1.01─1.02─1.03─1.04──────
Do not worship the exact ticks.
Worship the shape.
int4 is sparse.
bf16 is moderate.
fp16 is finer nearby than bf16.
So if the only task were,
"Please distinguish tiny local differences around one ordinary value,"
fp16 would often look better.
That is why people get confused.
They see the local sharpness.
They forget the long journey.
A model does not live only near 1.0 during training.
It travels all over the map.
Look again.
Local precision is not the whole story.
Yes?
3) Second picture: what happens at the edges¶
Now ask the harder question. What if the value is huge? What if the value is tiny? Then exponent bits become the boss. Here is the rough picture.
dynamic range
int4 : ├────narrow────┤
fp16 : ├──────────────medium──────────────┤
bf16 : ├────────────────────────────────────────────very wide────────────────────────────────────────────┤
fp32 : ├────────────────────────────────────────────very wide────────────────────────────────────────────┤
Now put some anchor numbers in your head.
| Format | Approx max magnitude | Rough takeaway |
|---|---:|---|
| fp16 | 6.55e4 | can overflow on large training spikes |
| bf16 | 3.39e38 | range close to fp32 |
| fp32 | 3.40e38 | wide reference range |
See the contrast.
fp16 has stronger local detail.
bf16 has dramatically wider travel distance.
Training usually rewards travel distance more.
Because overflow is loud and destructive.
Underflow is quieter, but still bad.
It can erase weak signals.
So the question is not,
"Which one looks elegant on a number line near one?"
The real question is,
"Which one survives messy optimizer life?"
That is where bf16 wins.
4) Worked numerical example from one ugly training step¶
Suppose one step produces these values.
large_activation = 80,000
normal_weight = 1.25
tiny_gradient = 0.00000001
Now test them mentally.
For fp16, the max safe value is about 65,504.
So 80,000 is already dangerous.
It may overflow.
The tiny gradient is also uncomfortable on the low side.
Maybe it survives as a subnormal.
Maybe it gets rounded away in practice.
Either way, it is fragile.
Now test bf16.
bf16 keeps fp32-style exponent range.
So 80,000 is fine.
1e-8 is also much safer to carry.
But near 1.25, bf16 has coarser local detail than fp16.
That is the exact trade-off.
fp16 is sharper nearby.
bf16 is safer globally.
So what does training choose?
Usually the safer road.
Because one overflow event can poison a step.
One vanished gradient can slow learning.
The stable copy of the blueprint matters more than the prettier copy.
See.
5) Why engineers say "safer," not "more accurate"¶
This phrasing matters in senior interviews. If you say, "bf16 is more accurate," you sound careless. Why? Because local precision and global safety are not the same thing. fp16 often has finer precision near ordinary values. bf16 often has better range across extreme values. Training teams choose bf16 because they hate numeric accidents. They hate surprise overflows. They hate silent underflows. They hate babysitting loss scaling more than necessary. That is the whole mood. bf16 is the format with better shock absorbers. Not the format with magically better eyesight. Look at the summary box.
fp16 → finer nearby detail, narrower safe range
bf16 → coarser nearby detail, much wider safe range
training → usually pick the wider safe range
Keep one line in your head. Range answers, "Will the number survive?" Precision answers, "How finely can I describe it if it survives?" Training first needs survival. Only then does fine detail matter. That is why bf16 is usually the default answer.
Where this lives in the wild¶
-
NVIDIA H100 Transformer Engine — teams pick bf16 training paths because large activations and gradients need wider safety margins.
-
Google TPU JAX pipelines — bf16 is common because it keeps fp32-like range while lowering memory traffic.
-
PyTorch autocast with bf16 — many fine-tuning jobs switch from fp16 to reduce overflow headaches.
-
DeepSpeed ZeRO training — large distributed runs often prefer bf16 when optimizer behavior gets numerically rough.
-
AWS Trainium LLM jobs — training recipes weigh bf16 range safety against raw memory pressure.
Pause and recall¶
-
What is the difference between precision and range?
-
Why can fp16 be locally sharper but still riskier for training?
-
Why is bf16 usually described as safer, not more accurate?
-
What kinds of values stress range during training?
Interview Q&A¶
Q1. Why use bf16 instead of fp16 for large-model training? A1. Because bf16 keeps much wider exponent range, so activations and gradients are less likely to overflow or vanish.
Common wrong answer to avoid: "Because bf16 has more mantissa precision than fp16."
Q2. Why say bf16 is safer instead of saying it is more accurate? A2. Because its main advantage is wider numeric survival range, not finer local value spacing.
Common wrong answer to avoid: "Safety and accuracy mean the same thing here."
Q3. Why does training care more about range than a simple frozen inference pass? A3. Because training includes unstable activations, gradients, and optimizer updates that sweep across many scales.
Common wrong answer to avoid: "Inference never uses large or tiny numbers."
Q4. Why not keep fp32 everywhere and avoid the whole discussion? A4. Because fp32 costs more memory and bandwidth, while bf16 preserves much of the safe range at lower cost.
Common wrong answer to avoid: "fp32 gives no value at all once bf16 exists."
Apply now (5 min)¶
Quick exercise. Write three numbers on paper. One huge. One ordinary. One tiny. Now explain which one threatens fp16 first and why. Sketch from memory two bars. One bar shows nearby precision. One bar shows overall range. Label where fp16 wins and where bf16 wins. Under the sketch, write one sentence on why the copied blueprint must survive before it can be detailed.
Bridge. Good. We now understand the storage trade-off. Next we compress the blueprint itself into compact field notes. → 04-quantization-core.md