01. Week 6 — Quantization & Fine-Tuning¶

Key concepts to master¶

Raw weight memory: params × bytes_per_param
fp32 vs fp16 vs bf16 vs int8 vs int4 — what changes in practice
Why bf16 is usually preferred to fp16 for training stability
Per-tensor vs per-channel quantization; why local scales matter
GPTQ vs AWQ — calibration/error-aware vs activation-aware
KV cache math; why serving cost changes with context and concurrency
MQA vs GQA vs standard MHA — memory-quality trade-off
PagedAttention as cache-memory management, not “new attention math”
LoRA low-rank decomposition and rank trade-offs
QLoRA as frozen 4-bit base + trainable adapters
Decision rule: prompt vs PEFT vs RAG

🧠 Mental models¶

Per-channel quantization: "Give each feature map its own ruler; one shared ruler crushes small channels."
GPTQ: "Compress the model, then sand down the worst reconstruction errors."
AWQ: "Protect the weights attached to the busiest neurons before shrinking everything else."
LoRA: "Bolt a small steering wheel onto a frozen truck instead of rebuilding the engine."
QLoRA: "Keep the base model zipped in 4-bit form and train only a lightweight patch layer."
Prompt vs PEFT vs RAG: "Choose between better instructions, a behavior patch, or an external memory."

⚠️ Common traps¶

Assuming INT4 automatically gives 4× faster inference; memory drops faster than end-to-end latency.
Ignoring KV-cache growth, so a quantized model still OOMs under long context or high concurrency.
Confusing GPTQ and AWQ: GPTQ minimizes post-quantization error; AWQ preserves activation-salient weights.
Fine-tuning to inject frequently changing facts that belong in RAG, creating stale memorized knowledge.
Merging adapters too early and losing the ability to swap tasks, compare domains, or roll back bad tuning.

🔗 Prerequisites & connections¶

Builds on: Module 05 (LLM Training Lifecycle) — model memory math, training stability, and why full fine-tuning is expensive. Feeds into: Module 07 (RAG Fundamentals) — the prompt vs PEFT vs RAG decision boundary becomes the starting point for retrieval systems.

💬 Interview phrasing¶

"You have a 70B model that does not fit in fp16 on one GPU. Walk me through INT8, INT4, GPTQ, and AWQ options."
"Why would you choose QLoRA over full fine-tuning for a domain adaptation task?"
"A quantized model is cheap in VRAM at rest but still crashes with long prompts. What is consuming memory?"
"When is prompt engineering enough, when do you use LoRA, and when do you stop and build RAG instead?"

⏱️ Difficulty markers¶

🟢 raw weight-memory math
🟢 fp16 vs bf16 trade-offs
🟡 per-tensor vs per-channel quantization
🟡 LoRA rank trade-offs
🔴 GPTQ vs AWQ calibration logic
🔴 QLoRA, adapter merging, and serving implications

Self-check questions¶

For fuller answers and worked examples, see 02_explainer.md §2-§6.

A 70B model in fp16 does not fit on an 80GB GPU. What are your options, and what trade-off does each imply? (explainer §1.1-§1.3)
Why is bf16 usually better than fp16 for training, even though fp16 has more mantissa bits? (§2.3-§2.4)
Show with numbers why per-tensor quantization can damage a small channel. (§3.2-§3.3)
GPTQ vs AWQ — what signal is each method using to reduce quality loss? (§3.4-§3.5)
Why can a quantized model still fail under long-context traffic? (§4.1-§4.2)
MQA vs GQA — what memory do they reduce, and what do they trade away? (§4.3-§4.4)
LoRA rank r — what do higher and lower ranks buy you? (§5.2-§5.3)
When is fine-tuning the wrong tool, and RAG the right tool? (§5.6, §6.6)

Health check¶

[ ] All 6 explainer sections read at least once
[ ] Can do raw weight-memory math for 7B, 13B, 70B from memory
[ ] Can explain KV cache without hand-waving
[ ] Can explain LoRA vs QLoRA cleanly
[ ] Assignment shipped with comparison table and decision memo
[ ] Ready to enter Module 07 with a clear “fine-tune vs RAG” boundary