Skip to content

00. Quantization & Fine-Tuning — The Five-Year-Old Version

Module 05 trained the model. This module makes it fit on real hardware and adapt to your task.


Imagine a sculptor carrying the blueprint for a giant temple statue. The blueprint is perfect. Every curve and angle is marked. But the road to the site is narrow. The truck is small. That small truck is the site constraint. So what to do? The sculptor copies the important parts into the field notes. The notes are lighter and fit the truck. Some tiny details get rounded away. That lost detail is the rounding error. Simple, no?

Now see the first job of this module. We learn how to shrink the blueprint without ruining the statue. One method uses one rough ruler for everything. Another gives each statue part its own ruler. Another tests a sample first and protects fragile pieces. Same goal every time: fit the truck, keep the shape. See. Tiny memory math helps here. If the full blueprint needs about 140 GB, and your truck carries 80 GB, it will not move. If the field notes shrink it to about 35 GB, now it can travel. That is the whole opening problem.

But the plans are only half the load. While the sculptor works, the site keeps fresh reference cards. Those cards remind the team what happened one moment ago. As the work grows longer, the card pile grows too. So even after the field notes fit, the truck can still choke later. That is why this module studies working memory also. Not just storage at rest. See. Fitting the statue and running the statue are different jobs. This is where many teams get surprised.

Then comes customization. Suppose the old statue is mostly right, but this city wants a new crown and hand pose. You do not redraw the whole blueprint. Wasteful. You place the overlay sketch on top and mark only the changes. Sometimes the main plan stays heavy. Sometimes even the main plan arrives as field notes. Then only the overlay learns the task. So the final choice is simple: shrink, patch, use outside facts, or leave it alone.


The placeholders you will see called back

Placeholder Meaning
the blueprint Full-precision weights — the detailed original model parameters.
the field notes Quantized weights — compressed version that fits the hardware.
the rounding error Quantization noise — detail lost during compression.
the overlay sketch LoRA adapter — thin trainable layer for task-specific changes.
the site constraint GPU memory — the physical limit that forces all these trade-offs.

Top resources


What's coming

  1. 01-opening-failure.md — 70B model that doesn't fit
  2. 02-number-formats.md — fp32, fp16, bf16, int8, int4
  3. 03-precision-vs-range.md — why bf16 training beats fp16
  4. 04-quantization-core.md — the rounding trick with real numbers
  5. 05-per-tensor-vs-per-channel.md — channel scale mismatch
  6. 06-gptq.md — calibration-based error-aware quantization
  7. 07-awq.md — activation-aware weight quantization
  8. 08-kv-cache-memory.md — weights are only half the story
  9. 09-mqa-gqa.md — sharing K/V across heads
  10. 10-paged-attention-serving.md — memory management for inference
  11. 11-lora.md — low-rank adaptation
  12. 12-qlora.md — quantized base + trainable adapters
  13. 13-peft-vs-rag-decision.md — choosing the right lever
  14. 14-honest-admission.md — what we don't fully understand

Bridge. The blueprint is too heavy for the truck — that is the opening failure we solve first. → 01-opening-failure.md