Skip to content

06. Module 06 Review — Quantization & Fine-Tuning

Focus: raw memory math, number formats, quantization trade-offs, KV-cache serving reality, LoRA/QLoRA, and the boundary between PEFT and RAG.

Review loop

  1. Skim the TOC of 02_explainer.md, then re-read the weakest chapter only.
  2. Re-answer the self-check questions in 01_weekly_plan.md without notes.
  3. Re-do the hardest prompts in 04_daily_recall.md from memory.
  4. Sketch the failure-fix table from explainer §6.1 with at least 10 rows.
  5. Review 05_hands_on_lab.md and note one measurement you trust, one assumption you do not trust, and one next experiment.
  6. Re-say the bridge to Module 07 from explainer §6.7 without looking.

Reflection

  • Which concept is still fuzzy: number formats, quantization methods, KV cache, or PEFT choice?
  • Where did your intuition improve most this week: memory math, deployment, or fine-tuning strategy?
  • What would you now explain differently to a PM who asks, “Why can’t we just use the biggest model?”

Conceptual checkpoint

  1. Why is a 70B fp16 model impossible on one 80GB GPU before runtime overhead even enters the picture? (explainer §1.1)
  2. fp16 vs bf16 — what is the real trade-off? (explainer §2.3-§2.4)
  3. Why can fewer mantissa bits still be the better training choice? (explainer §2.4)
  4. Per-tensor vs per-channel — what failure does per-channel fix? (explainer §3.3)
  5. GPTQ vs AWQ — what signal is each method using? (explainer §3.4-§3.5)
  6. Why does int4 often hurt edge cases before average demos? (explainer §3.6)
  7. Why can weight quantization leave serving failures unsolved? (explainer §4.1-§4.2)
  8. MQA vs GQA — what is shared, and what memory term drops? (explainer §4.3-§4.4)
  9. PagedAttention — why is the OS analogy appropriate? (explainer §4.5)
  10. Why is LoRA parameter-efficient? Use the A @ B picture. (explainer §5.2-§5.3)
  11. QLoRA — what is compressed and what remains trainable? (explainer §5.4)
  12. When is RAG the right answer instead of fine-tuning? (explainer §5.6, §6.6)

Applied checkpoint

  1. You have one 80GB GPU and must serve a 70B model at 8K context. What memory terms do you estimate before committing? (explainer §6.4)
  2. Your quantized model fits at idle but crashes under load. What is the most likely missing memory term? (explainer §4.1-§4.2)
  3. You have a 24GB GPU and a 7B model. Which fine-tuning path is most realistic and why? (explainer §5.1-§5.4)
  4. The product requirement changes from “better tone” to “fresh private policy answers.” What architecture decision changes immediately? (explainer §5.6, §6.7)
  5. A teammate says “just fine-tune the docs into the model.” Give the clean rebuttal. (explainer §5.6, §6.6)

Foundation-gap audit before Module 07

Module 07 assumes these are automatic:

  • [ ] I can do raw weight-memory math from memory
  • [ ] I understand why KV cache is a separate serving bill
  • [ ] I can explain GQA in one clean paragraph
  • [ ] I know why LoRA exists and when QLoRA is needed
  • [ ] I can distinguish prompt vs PEFT vs RAG without mixing them

If any box is unchecked, revisit 02_explainer.md chapters 4-5 and 03_study_material.md §5-§7 before moving on.

Self-evaluation

Section Score /
Conceptual checkpoint __ 12
Applied checkpoint __ 5
Foundation-gap audit __ 5
Total __ 22

Interpretation: - 19-22 = strong; move to Module 07 - 15-18 = okay; re-read weak sections before moving on - <15 = do not rush; rework the memory math and decision boundaries

Completion gate

  • [ ] All explainer chapters read once
  • [ ] Failure-fix table reconstructed from memory
  • [ ] Assignment shipped with decision memo
  • [ ] Can explain bf16, GPTQ, AWQ, GQA, LoRA, and QLoRA cleanly
  • [ ] Can defend “fine-tune vs RAG” with examples
  • [ ] Ready to enter 08_rag_system_design