06. Module 06 Review — Quantization & Fine-Tuning¶
Focus: raw memory math, number formats, quantization trade-offs, KV-cache serving reality, LoRA/QLoRA, and the boundary between PEFT and RAG.
Review loop¶
- Skim the TOC of
02_explainer.md, then re-read the weakest chapter only. - Re-answer the self-check questions in
01_weekly_plan.mdwithout notes. - Re-do the hardest prompts in
04_daily_recall.mdfrom memory. - Sketch the failure-fix table from explainer §6.1 with at least 10 rows.
- Review
05_hands_on_lab.mdand note one measurement you trust, one assumption you do not trust, and one next experiment. - Re-say the bridge to Module 07 from explainer §6.7 without looking.
Reflection¶
- Which concept is still fuzzy: number formats, quantization methods, KV cache, or PEFT choice?
- Where did your intuition improve most this week: memory math, deployment, or fine-tuning strategy?
- What would you now explain differently to a PM who asks, “Why can’t we just use the biggest model?”
Conceptual checkpoint¶
- Why is a 70B fp16 model impossible on one 80GB GPU before runtime overhead even enters the picture? (explainer §1.1)
- fp16 vs bf16 — what is the real trade-off? (explainer §2.3-§2.4)
- Why can fewer mantissa bits still be the better training choice? (explainer §2.4)
- Per-tensor vs per-channel — what failure does per-channel fix? (explainer §3.3)
- GPTQ vs AWQ — what signal is each method using? (explainer §3.4-§3.5)
- Why does int4 often hurt edge cases before average demos? (explainer §3.6)
- Why can weight quantization leave serving failures unsolved? (explainer §4.1-§4.2)
- MQA vs GQA — what is shared, and what memory term drops? (explainer §4.3-§4.4)
- PagedAttention — why is the OS analogy appropriate? (explainer §4.5)
- Why is LoRA parameter-efficient? Use the
A @ Bpicture. (explainer §5.2-§5.3) - QLoRA — what is compressed and what remains trainable? (explainer §5.4)
- When is RAG the right answer instead of fine-tuning? (explainer §5.6, §6.6)
Applied checkpoint¶
- You have one 80GB GPU and must serve a 70B model at 8K context. What memory terms do you estimate before committing? (explainer §6.4)
- Your quantized model fits at idle but crashes under load. What is the most likely missing memory term? (explainer §4.1-§4.2)
- You have a 24GB GPU and a 7B model. Which fine-tuning path is most realistic and why? (explainer §5.1-§5.4)
- The product requirement changes from “better tone” to “fresh private policy answers.” What architecture decision changes immediately? (explainer §5.6, §6.7)
- A teammate says “just fine-tune the docs into the model.” Give the clean rebuttal. (explainer §5.6, §6.6)
Foundation-gap audit before Module 07¶
Module 07 assumes these are automatic:
- [ ] I can do raw weight-memory math from memory
- [ ] I understand why KV cache is a separate serving bill
- [ ] I can explain GQA in one clean paragraph
- [ ] I know why LoRA exists and when QLoRA is needed
- [ ] I can distinguish prompt vs PEFT vs RAG without mixing them
If any box is unchecked, revisit 02_explainer.md chapters 4-5 and 03_study_material.md §5-§7 before moving on.
Self-evaluation¶
| Section | Score | / |
|---|---|---|
| Conceptual checkpoint | __ | 12 |
| Applied checkpoint | __ | 5 |
| Foundation-gap audit | __ | 5 |
| Total | __ | 22 |
Interpretation:
- 19-22 = strong; move to Module 07
- 15-18 = okay; re-read weak sections before moving on
- <15 = do not rush; rework the memory math and decision boundaries
Completion gate¶
- [ ] All explainer chapters read once
- [ ] Failure-fix table reconstructed from memory
- [ ] Assignment shipped with decision memo
- [ ] Can explain bf16, GPTQ, AWQ, GQA, LoRA, and QLoRA cleanly
- [ ] Can defend “fine-tune vs RAG” with examples
- [ ] Ready to enter
08_rag_system_design