Skip to content

01. Week 6 — Quantization & Fine-Tuning

Key concepts to master

  • Raw weight memory: params × bytes_per_param
  • fp32 vs fp16 vs bf16 vs int8 vs int4 — what changes in practice
  • Why bf16 is usually preferred to fp16 for training stability
  • Per-tensor vs per-channel quantization; why local scales matter
  • GPTQ vs AWQ — calibration/error-aware vs activation-aware
  • KV cache math; why serving cost changes with context and concurrency
  • MQA vs GQA vs standard MHA — memory-quality trade-off
  • PagedAttention as cache-memory management, not “new attention math”
  • LoRA low-rank decomposition and rank trade-offs
  • QLoRA as frozen 4-bit base + trainable adapters
  • Decision rule: prompt vs PEFT vs RAG

🧠 Mental models

  • Per-channel quantization: "Give each feature map its own ruler; one shared ruler crushes small channels."
  • GPTQ: "Compress the model, then sand down the worst reconstruction errors."
  • AWQ: "Protect the weights attached to the busiest neurons before shrinking everything else."
  • LoRA: "Bolt a small steering wheel onto a frozen truck instead of rebuilding the engine."
  • QLoRA: "Keep the base model zipped in 4-bit form and train only a lightweight patch layer."
  • Prompt vs PEFT vs RAG: "Choose between better instructions, a behavior patch, or an external memory."

⚠️ Common traps

  • Assuming INT4 automatically gives 4× faster inference; memory drops faster than end-to-end latency.
  • Ignoring KV-cache growth, so a quantized model still OOMs under long context or high concurrency.
  • Confusing GPTQ and AWQ: GPTQ minimizes post-quantization error; AWQ preserves activation-salient weights.
  • Fine-tuning to inject frequently changing facts that belong in RAG, creating stale memorized knowledge.
  • Merging adapters too early and losing the ability to swap tasks, compare domains, or roll back bad tuning.

🔗 Prerequisites & connections

Builds on: Module 05 (LLM Training Lifecycle) — model memory math, training stability, and why full fine-tuning is expensive. Feeds into: Module 07 (RAG Fundamentals) — the prompt vs PEFT vs RAG decision boundary becomes the starting point for retrieval systems.

💬 Interview phrasing

  • "You have a 70B model that does not fit in fp16 on one GPU. Walk me through INT8, INT4, GPTQ, and AWQ options."
  • "Why would you choose QLoRA over full fine-tuning for a domain adaptation task?"
  • "A quantized model is cheap in VRAM at rest but still crashes with long prompts. What is consuming memory?"
  • "When is prompt engineering enough, when do you use LoRA, and when do you stop and build RAG instead?"

⏱️ Difficulty markers

  • 🟢 raw weight-memory math
  • 🟢 fp16 vs bf16 trade-offs
  • 🟡 per-tensor vs per-channel quantization
  • 🟡 LoRA rank trade-offs
  • 🔴 GPTQ vs AWQ calibration logic
  • 🔴 QLoRA, adapter merging, and serving implications

Self-check questions

For fuller answers and worked examples, see 02_explainer.md §2-§6.

  1. A 70B model in fp16 does not fit on an 80GB GPU. What are your options, and what trade-off does each imply? (explainer §1.1-§1.3)
  2. Why is bf16 usually better than fp16 for training, even though fp16 has more mantissa bits? (§2.3-§2.4)
  3. Show with numbers why per-tensor quantization can damage a small channel. (§3.2-§3.3)
  4. GPTQ vs AWQ — what signal is each method using to reduce quality loss? (§3.4-§3.5)
  5. Why can a quantized model still fail under long-context traffic? (§4.1-§4.2)
  6. MQA vs GQA — what memory do they reduce, and what do they trade away? (§4.3-§4.4)
  7. LoRA rank r — what do higher and lower ranks buy you? (§5.2-§5.3)
  8. When is fine-tuning the wrong tool, and RAG the right tool? (§5.6, §6.6)

Health check

  • [ ] All 6 explainer sections read at least once
  • [ ] Can do raw weight-memory math for 7B, 13B, 70B from memory
  • [ ] Can explain KV cache without hand-waving
  • [ ] Can explain LoRA vs QLoRA cleanly
  • [ ] Assignment shipped with comparison table and decision memo
  • [ ] Ready to enter Module 07 with a clear “fine-tune vs RAG” boundary