05. Assignment 6 — Quantize, Adapt, Decide¶
Week 6. Take a small open model, make it cheaper to serve, make it more useful for one domain, and justify the trade-offs in writing.
Required reading first:
02_explainer.mdchapters 1-5 and03_study_material.md§1-§7. If you cannot do the memory math and the prompt-vs-PEFT-vs-RAG distinction cleanly, do not start the experiment yet.
Goal¶
Build a mini evidence pack that answers three real engineering questions:
- Should we quantize?
- Should we fine-tune?
- If we fine-tune, should we use LoRA / QLoRA instead of full fine-tuning?
Recommended setup¶
Choose one open model you can realistically run on your hardware: - Llama 3.x 8B - Mistral 7B - Phi-3 mini / small - Qwen 2.5 7B
If hardware is tight, use a 3B–8B model.
The point is the trade-off reasoning, not bragging rights.
Required experiment design¶
Part A — Baseline memory math¶
Produce a short note (memory_math.md) with:
- raw weight memory in fp16, int8, and int4
- expected serving-memory caveat for KV cache
- hardware assumption (GPU type / VRAM)
Cross-ref: 02_explainer.md §1.2, §2.5, §4.2.
Part B — Base model benchmark¶
Run the unmodified base model on a small eval set.
Minimum deliverables: - latency / tokens-sec - rough GPU memory usage - quality score on your chosen task
Part C — Quantized benchmark¶
Quantize the model using one of: - GPTQ - AWQ - GGUF-based local quantized artifact (if you are testing local serving rather than GPU-server deployment)
Re-run the same eval.
Document: - memory change - speed change - quality change
Cross-ref: 02_explainer.md §3.3-§3.7.
Part D — LoRA or QLoRA adaptation¶
Create a small domain dataset (minimum 50–200 examples) and fine-tune with: - LoRA if fp16/bf16 training is feasible on your hardware - QLoRA if you need the compressed base to make training feasible
Re-run the same eval.
Document: - training setup - trainable parameter count - memory considerations - before/after quality change
Cross-ref: 02_explainer.md §5.1-§5.5.
Part E — Decision memo¶
Write decision_memo.md answering:
1. Should this use case ship with base, quantized base, LoRA, or QLoRA?
2. Which metric mattered most: quality, latency, memory, or cost?
3. If the use case needed fresh private knowledge next week, would you fine-tune again or switch to RAG?
Cross-ref: 02_explainer.md §5.6, §6.6.
Eval choices¶
Choose one: - custom eval set (30–50 domain questions) - summarization task with ROUGE / judge score - classification / extraction task with accuracy / F1 - support-bot style FAQ set with exact-format grading
Required deliverables¶
memory_math.mdbenchmark_results.csvorbenchmark_results.mdtrain.pyor notebook for LoRA/QLoRA runeval.pyor notebook for repeatable evaluationdecision_memo.mdREADME.mdwith setup, hardware, results, and conclusion
Required comparison table¶
| Config | Weight memory | KV-cache note | Tokens/sec | Quality score | Training cost | Recommendation |
|---|---|---|---|---|---|---|
| Base fp16/bf16 | N/A | |||||
| Quantized (int8 or int4) | N/A | |||||
| LoRA or QLoRA |
Success criteria¶
- Raw memory math is correct and explained clearly
- Quantization trade-off is measured, not guessed
- Adapter method is justified by hardware reality
- Decision memo cleanly distinguishes prompt vs PEFT vs RAG
- README has a plain-English “when to use which” conclusion
Common pitfalls¶
- Counting only weight memory and ignoring KV cache (explainer §4.1-§4.2)
- Treating GGUF as a quantization method rather than a serving artifact / format (explainer §3.7)
- Claiming int4 is “basically free” without eval evidence (explainer §3.6)
- Fine-tuning to solve a freshness problem that should be RAG (explainer §5.6)
- Using too tiny or too noisy a dataset and then overclaiming the result
Stretch goal¶
Add a small section to decision_memo.md titled:
“If this system moved to a private-doc setting tomorrow, what would Module 07 change?”
That forces the bridge from fine-tuning to RAG.
Why this hands_on_lab matters¶
A Lead AI Engineer is asked versions of the same question again and again:
- Can we make it fit?
- Can we make it cheaper?
- Can we make it better for our domain?
- Do we really need to retrain, or do we just need better retrieval?
This hands_on_lab gives you the habit that matters: measure first, then decide.