05. Assignment 6 — Quantize, Adapt, Decide¶

Week 6. Take a small open model, make it cheaper to serve, make it more useful for one domain, and justify the trade-offs in writing.

Required reading first: 02_explainer.md chapters 1-5 and 03_study_material.md §1-§7. If you cannot do the memory math and the prompt-vs-PEFT-vs-RAG distinction cleanly, do not start the experiment yet.

Goal¶

Build a mini evidence pack that answers three real engineering questions:

Should we quantize?
Should we fine-tune?
If we fine-tune, should we use LoRA / QLoRA instead of full fine-tuning?

Recommended setup¶

Choose one open model you can realistically run on your hardware: - Llama 3.x 8B - Mistral 7B - Phi-3 mini / small - Qwen 2.5 7B

If hardware is tight, use a 3B–8B model.

The point is the trade-off reasoning, not bragging rights.

Required experiment design¶

Part A — Baseline memory math¶

Produce a short note (memory_math.md) with: - raw weight memory in fp16, int8, and int4 - expected serving-memory caveat for KV cache - hardware assumption (GPU type / VRAM)

Cross-ref: 02_explainer.md §1.2, §2.5, §4.2.

Part B — Base model benchmark¶

Run the unmodified base model on a small eval set.

Minimum deliverables: - latency / tokens-sec - rough GPU memory usage - quality score on your chosen task

Part C — Quantized benchmark¶

Quantize the model using one of: - GPTQ - AWQ - GGUF-based local quantized artifact (if you are testing local serving rather than GPU-server deployment)

Re-run the same eval.

Document: - memory change - speed change - quality change

Cross-ref: 02_explainer.md §3.3-§3.7.

Part D — LoRA or QLoRA adaptation¶

Create a small domain dataset (minimum 50–200 examples) and fine-tune with: - LoRA if fp16/bf16 training is feasible on your hardware - QLoRA if you need the compressed base to make training feasible

Re-run the same eval.

Document: - training setup - trainable parameter count - memory considerations - before/after quality change

Cross-ref: 02_explainer.md §5.1-§5.5.

Part E — Decision memo¶

Write decision_memo.md answering: 1. Should this use case ship with base, quantized base, LoRA, or QLoRA? 2. Which metric mattered most: quality, latency, memory, or cost? 3. If the use case needed fresh private knowledge next week, would you fine-tune again or switch to RAG?

Cross-ref: 02_explainer.md §5.6, §6.6.

Eval choices¶

Choose one: - custom eval set (30–50 domain questions) - summarization task with ROUGE / judge score - classification / extraction task with accuracy / F1 - support-bot style FAQ set with exact-format grading

Required deliverables¶

memory_math.md
benchmark_results.csv or benchmark_results.md
train.py or notebook for LoRA/QLoRA run
eval.py or notebook for repeatable evaluation
decision_memo.md
README.md with setup, hardware, results, and conclusion

Required comparison table¶

Config	Weight memory	KV-cache note	Tokens/sec	Quality score	Training cost	Recommendation
Base fp16/bf16					N/A
Quantized (int8 or int4)					N/A
LoRA or QLoRA

Success criteria¶

Raw memory math is correct and explained clearly
Quantization trade-off is measured, not guessed
Adapter method is justified by hardware reality
Decision memo cleanly distinguishes prompt vs PEFT vs RAG
README has a plain-English “when to use which” conclusion

Common pitfalls¶

Counting only weight memory and ignoring KV cache (explainer §4.1-§4.2)
Treating GGUF as a quantization method rather than a serving artifact / format (explainer §3.7)
Claiming int4 is “basically free” without eval evidence (explainer §3.6)
Fine-tuning to solve a freshness problem that should be RAG (explainer §5.6)
Using too tiny or too noisy a dataset and then overclaiming the result

Stretch goal¶

Add a small section to decision_memo.md titled:

“If this system moved to a private-doc setting tomorrow, what would Module 07 change?”

That forces the bridge from fine-tuning to RAG.

Why this hands_on_lab matters¶

A Lead AI Engineer is asked versions of the same question again and again:

Can we make it fit?
Can we make it cheaper?
Can we make it better for our domain?
Do we really need to retrain, or do we just need better retrieval?

This hands_on_lab gives you the habit that matters: measure first, then decide.