03. Week 6 — Quantization & Fine-Tuning¶
For deep understanding see
02_explainer.md— narrative, diagrams, worked memory math, retrieval prompts. This file is the quick-reference glossary: formulas, tables, definitions, and decision frameworks.
Section 1 — Deployment failure and raw memory math¶
The first question is boring and essential:
Can the model fit?
Raw weight memory¶
| Format | Bytes / parameter |
|---|---|
| fp32 | 4 |
| fp16 | 2 |
| bf16 | 2 |
| int8 | 1 |
| int4 | 0.5 |
Reference table¶
| Model size | fp32 | fp16 / bf16 | int8 | int4 raw |
|---|---|---|---|---|
| 7B | 28GB | 14GB | 7GB | 3.5GB |
| 13B | 52GB | 26GB | 13GB | 6.5GB |
| 70B | 280GB | 140GB | 70GB | 35GB |
Cross-ref: 02_explainer.md §1.1-§1.2, §2.5.
Section 2 — Number formats: fp32, fp16, bf16, int8, int4¶
| Format | Bits | Practical meaning | Best use |
|---|---|---|---|
| fp32 | 32 | high precision + wide range | reference math, some optimizer states |
| fp16 | 16 | more local precision than bf16, narrower range | inference, mixed precision |
| bf16 | 16 | fp32-like exponent range, less mantissa | training default |
| int8 | 8 | bucketized representation with scale | conservative inference quantization |
| int4 | 4 | aggressive bucketization | max compression / QLoRA base |
Bit structure intuition¶
| Format | Sign | Exponent | Mantissa |
|---|---|---|---|
| fp32 | 1 | 8 | 23 |
| fp16 | 1 | 5 | 10 |
| bf16 | 1 | 8 | 7 |
The key trade-off¶
- Precision = how finely nearby values can be separated
- Range = how small/large values can be represented without underflow/overflow
Why bf16 training is usually better than fp16¶
- Same 16-bit footprint
- fp32-like exponent range
- Fewer overflows/underflows in training
- Less dependence on loss scaling
One-line answer: bf16 sacrifices some local precision to preserve much safer dynamic range, and training usually benefits more from the range.
Cross-ref: 02_explainer.md §2.2-§2.4.
Section 3 — Quantization basics¶
Core equations¶
Where:
- w = original floating weight
- s = scale
- q = integer bucket
- w_hat = dequantized approximation
Symmetric quantization intuition¶
- choose a max magnitude
- map it to max integer level
- round everything else onto that grid
The inevitable cost¶
- rounding error
- information loss
- possible quality degradation on edge cases
Per-tensor vs per-channel¶
| Scheme | Scale granularity | Pros | Cons |
|---|---|---|---|
| per-tensor | one scale per tensor | simple, low metadata | small channels can get crushed |
| per-channel | one scale per channel / row / output dim | better quality | more metadata / complexity |
| group-wise | one scale per block | compromise | tuning/runtime complexity |
Cross-ref: 02_explainer.md §3.1-§3.3.
Section 4 — GPTQ, AWQ, and what they actually optimize¶
GPTQ¶
- post-training quantization
- calibration-based
- tries to minimize layer-output reconstruction error on representative inputs
- excellent for offline deployment artifacts
Memory hook: GPTQ = “protect the output behavior during rounding.”
AWQ¶
- activation-aware weight quantization
- identifies salient channels/weights using representative activations
- protects weights that matter most under real activations
Memory hook: AWQ = “importance is weight × activation, not weight alone.”
Important distinction¶
GGUF is a model file format / ecosystem artifact, not the same thing as the quantization algorithm itself.
What tends to degrade first at int4¶
- exact formatting
- code quality
- multilingual edge cases
- small-label classification boundaries
- multimodal quality
Cross-ref: 02_explainer.md §3.4-§3.7.
Section 5 — KV cache, MQA, GQA, PagedAttention¶
KV cache memory¶
2 is for K and V.
70B-style 8K example (GQA)¶
Assume: - 80 layers - 8 KV heads - head dim 128 - 8K context - bf16 cache (2 bytes)
Then per request is roughly:
≈ 2.5GB
So 8 concurrent 8K requests:
≈ 20GB KV cache
MQA vs GQA vs standard MHA¶
| Method | KV heads | Memory effect | Trade-off |
|---|---|---|---|
| MHA | one KV per query head | baseline | highest flexibility |
| GQA | fewer KV heads than query heads | big KV savings | small quality trade-off |
| MQA | one shared KV head | massive KV savings | more aggressive sharing |
PagedAttention¶
- allocates KV cache in blocks/pages
- reduces fragmentation
- improves utilization under variable request lengths
- enables better serving throughput
Cross-ref: 02_explainer.md §4.1-§4.6.
Section 6 — LoRA, QLoRA, and adapter methods¶
LoRA formula¶
Where:
- W_base frozen
- A shape d x r
- B shape r x k
- r small (8, 16, 32, 64)
Why LoRA is efficient¶
For a 4096 x 4096 matrix:
- full update = 16,777,216 params
- rank-16 LoRA = 131,072 params
- about 0.78% of full matrix params
QLoRA¶
- frozen 4-bit base model
- train LoRA adapters on top
- huge memory savings relative to full fine-tuning
- great for limited-hardware adaptation
Other PEFT methods¶
| Method | Idea | Typical use |
|---|---|---|
| prompt tuning | learn soft prompt vectors | very lightweight steering |
| prefix tuning | learn trainable attention prefixes | stronger than soft prompts |
| classic adapters | insert small trainable modules | explicit modular adaptation |
| IA3-style methods | learn scaling vectors | ultra-light adaptation |
Cross-ref: 02_explainer.md §5.1-§5.5.
Section 7 — Decision framework: prompt vs fine-tune vs RAG¶
| Problem smell | Best first lever | Why |
|---|---|---|
| Format / instruction quality is weak | prompt | cheapest change |
| Knowledge is changing / private / fresh | RAG | do not bake volatile knowledge into weights |
| Stable domain behavior / tone / schema needs repetition | LoRA / PEFT | behavior change is systematic |
| Large, strategic capability shift with lots of data + budget | full fine-tune | highest capacity, highest cost |
| Limited hardware but need learned adaptation | QLoRA | compressed base + small trainable overlay |
Cross-ref: 02_explainer.md §5.6, §6.6.
Section 8 — Quick interview answers¶
Why bf16 over fp16 for training?¶
Because bf16 keeps the 8-bit exponent of fp32, so it is much more numerically stable for large/small activations and gradients.
Why per-channel over per-tensor?¶
Because different channels often have different natural scales; one global scale can erase small-but-important channels.
GPTQ vs AWQ?¶
GPTQ minimizes reconstruction error on calibration data; AWQ uses activation importance to protect salient weights/channels.
Why can a quantized model still OOM?¶
Because KV cache grows with sequence length and concurrency and is often still stored in fp16/bf16.
Why LoRA?¶
Because full fine-tuning updates all weights; LoRA learns a low-rank overlay with a tiny fraction of parameters.
When is RAG better than fine-tuning?¶
When the missing capability is knowledge freshness / private-doc access, not stable behavior change.
Reading list¶
02_explainer.md— primary narrative- LoRA paper (Hu et al., 2021)
- QLoRA paper (Dettmers et al., 2023)
- GPTQ + AWQ papers / blogs
- vLLM / PagedAttention blog posts
Reference material¶
YouTube¶
- Tim Dettmers | QLoRA: Efficient Finetuning of Quantized Large Language Models — best single talk on why 4-bit bases + adapters changed practical fine-tuning.
- A Hacker's Guide to Language Models — practical intuition for adaptation choices, evals, and serving reality.
Blogs¶
- Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA — practical walkthrough of 4-bit loading and adapter fine-tuning.
- PEFT: Parameter-Efficient Fine-Tuning of Billion-Scale Models on Low-Resource Hardware — overview of the major PEFT family and trade-offs.
Self-check¶
- Do raw weight-memory math for 7B, 13B, and 70B. (explainer §1-§2)
- Explain bf16 vs fp16 without hand-waving. (§2.3-§2.4)
- Explain per-tensor vs per-channel with numbers. (§3.2-§3.3)
- Derive why KV cache scales with seq length and concurrency. (§4.2)
- Explain LoRA vs QLoRA vs RAG as different tools for different problems. (§5.4-§5.6)