01. Week 6 — Quantization & Fine-Tuning¶
Key concepts to master¶
- Raw weight memory:
params × bytes_per_param - fp32 vs fp16 vs bf16 vs int8 vs int4 — what changes in practice
- Why bf16 is usually preferred to fp16 for training stability
- Per-tensor vs per-channel quantization; why local scales matter
- GPTQ vs AWQ — calibration/error-aware vs activation-aware
- KV cache math; why serving cost changes with context and concurrency
- MQA vs GQA vs standard MHA — memory-quality trade-off
- PagedAttention as cache-memory management, not “new attention math”
- LoRA low-rank decomposition and rank trade-offs
- QLoRA as frozen 4-bit base + trainable adapters
- Decision rule: prompt vs PEFT vs RAG
🧠 Mental models¶
- Per-channel quantization: "Give each feature map its own ruler; one shared ruler crushes small channels."
- GPTQ: "Compress the model, then sand down the worst reconstruction errors."
- AWQ: "Protect the weights attached to the busiest neurons before shrinking everything else."
- LoRA: "Bolt a small steering wheel onto a frozen truck instead of rebuilding the engine."
- QLoRA: "Keep the base model zipped in 4-bit form and train only a lightweight patch layer."
- Prompt vs PEFT vs RAG: "Choose between better instructions, a behavior patch, or an external memory."
⚠️ Common traps¶
- Assuming INT4 automatically gives 4× faster inference; memory drops faster than end-to-end latency.
- Ignoring KV-cache growth, so a quantized model still OOMs under long context or high concurrency.
- Confusing GPTQ and AWQ: GPTQ minimizes post-quantization error; AWQ preserves activation-salient weights.
- Fine-tuning to inject frequently changing facts that belong in RAG, creating stale memorized knowledge.
- Merging adapters too early and losing the ability to swap tasks, compare domains, or roll back bad tuning.
🔗 Prerequisites & connections¶
Builds on: Module 05 (LLM Training Lifecycle) — model memory math, training stability, and why full fine-tuning is expensive. Feeds into: Module 07 (RAG Fundamentals) — the prompt vs PEFT vs RAG decision boundary becomes the starting point for retrieval systems.
💬 Interview phrasing¶
- "You have a 70B model that does not fit in fp16 on one GPU. Walk me through INT8, INT4, GPTQ, and AWQ options."
- "Why would you choose QLoRA over full fine-tuning for a domain adaptation task?"
- "A quantized model is cheap in VRAM at rest but still crashes with long prompts. What is consuming memory?"
- "When is prompt engineering enough, when do you use LoRA, and when do you stop and build RAG instead?"
⏱️ Difficulty markers¶
- 🟢 raw weight-memory math
- 🟢 fp16 vs bf16 trade-offs
- 🟡 per-tensor vs per-channel quantization
- 🟡 LoRA rank trade-offs
- 🔴 GPTQ vs AWQ calibration logic
- 🔴 QLoRA, adapter merging, and serving implications
Self-check questions¶
For fuller answers and worked examples, see 02_explainer.md §2-§6.
- A 70B model in fp16 does not fit on an 80GB GPU. What are your options, and what trade-off does each imply? (explainer §1.1-§1.3)
- Why is bf16 usually better than fp16 for training, even though fp16 has more mantissa bits? (§2.3-§2.4)
- Show with numbers why per-tensor quantization can damage a small channel. (§3.2-§3.3)
- GPTQ vs AWQ — what signal is each method using to reduce quality loss? (§3.4-§3.5)
- Why can a quantized model still fail under long-context traffic? (§4.1-§4.2)
- MQA vs GQA — what memory do they reduce, and what do they trade away? (§4.3-§4.4)
- LoRA rank
r— what do higher and lower ranks buy you? (§5.2-§5.3) - When is fine-tuning the wrong tool, and RAG the right tool? (§5.6, §6.6)
Health check¶
- [ ] All 6 explainer sections read at least once
- [ ] Can do raw weight-memory math for 7B, 13B, 70B from memory
- [ ] Can explain KV cache without hand-waving
- [ ] Can explain LoRA vs QLoRA cleanly
- [ ] Assignment shipped with comparison table and decision memo
- [ ] Ready to enter Module 07 with a clear “fine-tune vs RAG” boundary