00. Quantization & Fine-Tuning — The Five-Year-Old Version¶
Module 05 trained the model. This module makes it fit on real hardware and adapt to your task.
Imagine a sculptor carrying the blueprint for a giant temple statue. The blueprint is perfect. Every curve and angle is marked. But the road to the site is narrow. The truck is small. That small truck is the site constraint. So what to do? The sculptor copies the important parts into the field notes. The notes are lighter and fit the truck. Some tiny details get rounded away. That lost detail is the rounding error. Simple, no?
Now see the first job of this module. We learn how to shrink the blueprint without ruining the statue. One method uses one rough ruler for everything. Another gives each statue part its own ruler. Another tests a sample first and protects fragile pieces. Same goal every time: fit the truck, keep the shape. See. Tiny memory math helps here. If the full blueprint needs about 140 GB, and your truck carries 80 GB, it will not move. If the field notes shrink it to about 35 GB, now it can travel. That is the whole opening problem.
But the plans are only half the load. While the sculptor works, the site keeps fresh reference cards. Those cards remind the team what happened one moment ago. As the work grows longer, the card pile grows too. So even after the field notes fit, the truck can still choke later. That is why this module studies working memory also. Not just storage at rest. See. Fitting the statue and running the statue are different jobs. This is where many teams get surprised.
Then comes customization. Suppose the old statue is mostly right, but this city wants a new crown and hand pose. You do not redraw the whole blueprint. Wasteful. You place the overlay sketch on top and mark only the changes. Sometimes the main plan stays heavy. Sometimes even the main plan arrives as field notes. Then only the overlay learns the task. So the final choice is simple: shrink, patch, use outside facts, or leave it alone.
The placeholders you will see called back¶
| Placeholder | Meaning |
|---|---|
| the blueprint | Full-precision weights — the detailed original model parameters. |
| the field notes | Quantized weights — compressed version that fits the hardware. |
| the rounding error | Quantization noise — detail lost during compression. |
| the overlay sketch | LoRA adapter — thin trainable layer for task-specific changes. |
| the site constraint | GPU memory — the physical limit that forces all these trade-offs. |
Top resources¶
- LoRA: Low-Rank Adaptation of Large Language Models — the original paper for the overlay-sketch idea.
- QLoRA: Efficient Finetuning of Quantized LLMs — the key paper for 4-bit bases plus trainable adapters.
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — calibration-based weight compression with error control.
- AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration — protects the weights attached to important activations.
- vLLM and PagedAttention — the clearest practical explanation of KV-cache memory management.
- Hugging Face Transformers Quantization Docs — a good implementation map across common toolchains.
- Sebastian Raschka on LoRA and DoRA — intuitive engineering-first explanation of adapter tuning.
What's coming¶
- 01-opening-failure.md — 70B model that doesn't fit
- 02-number-formats.md — fp32, fp16, bf16, int8, int4
- 03-precision-vs-range.md — why bf16 training beats fp16
- 04-quantization-core.md — the rounding trick with real numbers
- 05-per-tensor-vs-per-channel.md — channel scale mismatch
- 06-gptq.md — calibration-based error-aware quantization
- 07-awq.md — activation-aware weight quantization
- 08-kv-cache-memory.md — weights are only half the story
- 09-mqa-gqa.md — sharing K/V across heads
- 10-paged-attention-serving.md — memory management for inference
- 11-lora.md — low-rank adaptation
- 12-qlora.md — quantized base + trainable adapters
- 13-peft-vs-rag-decision.md — choosing the right lever
- 14-honest-admission.md — what we don't fully understand
Bridge. The blueprint is too heavy for the truck — that is the opening failure we solve first. → 01-opening-failure.md