Skip to content

05. Assignment 6 — Quantize, Adapt, Decide

Week 6. Take a small open model, make it cheaper to serve, make it more useful for one domain, and justify the trade-offs in writing.

Required reading first: 02_explainer.md chapters 1-5 and 03_study_material.md §1-§7. If you cannot do the memory math and the prompt-vs-PEFT-vs-RAG distinction cleanly, do not start the experiment yet.

Goal

Build a mini evidence pack that answers three real engineering questions:

  1. Should we quantize?
  2. Should we fine-tune?
  3. If we fine-tune, should we use LoRA / QLoRA instead of full fine-tuning?

Choose one open model you can realistically run on your hardware: - Llama 3.x 8B - Mistral 7B - Phi-3 mini / small - Qwen 2.5 7B

If hardware is tight, use a 3B–8B model.

The point is the trade-off reasoning, not bragging rights.

Required experiment design

Part A — Baseline memory math

Produce a short note (memory_math.md) with: - raw weight memory in fp16, int8, and int4 - expected serving-memory caveat for KV cache - hardware assumption (GPU type / VRAM)

Cross-ref: 02_explainer.md §1.2, §2.5, §4.2.

Part B — Base model benchmark

Run the unmodified base model on a small eval set.

Minimum deliverables: - latency / tokens-sec - rough GPU memory usage - quality score on your chosen task

Part C — Quantized benchmark

Quantize the model using one of: - GPTQ - AWQ - GGUF-based local quantized artifact (if you are testing local serving rather than GPU-server deployment)

Re-run the same eval.

Document: - memory change - speed change - quality change

Cross-ref: 02_explainer.md §3.3-§3.7.

Part D — LoRA or QLoRA adaptation

Create a small domain dataset (minimum 50–200 examples) and fine-tune with: - LoRA if fp16/bf16 training is feasible on your hardware - QLoRA if you need the compressed base to make training feasible

Re-run the same eval.

Document: - training setup - trainable parameter count - memory considerations - before/after quality change

Cross-ref: 02_explainer.md §5.1-§5.5.

Part E — Decision memo

Write decision_memo.md answering: 1. Should this use case ship with base, quantized base, LoRA, or QLoRA? 2. Which metric mattered most: quality, latency, memory, or cost? 3. If the use case needed fresh private knowledge next week, would you fine-tune again or switch to RAG?

Cross-ref: 02_explainer.md §5.6, §6.6.

Eval choices

Choose one: - custom eval set (30–50 domain questions) - summarization task with ROUGE / judge score - classification / extraction task with accuracy / F1 - support-bot style FAQ set with exact-format grading

Required deliverables

  1. memory_math.md
  2. benchmark_results.csv or benchmark_results.md
  3. train.py or notebook for LoRA/QLoRA run
  4. eval.py or notebook for repeatable evaluation
  5. decision_memo.md
  6. README.md with setup, hardware, results, and conclusion

Required comparison table

Config Weight memory KV-cache note Tokens/sec Quality score Training cost Recommendation
Base fp16/bf16 N/A
Quantized (int8 or int4) N/A
LoRA or QLoRA

Success criteria

  • Raw memory math is correct and explained clearly
  • Quantization trade-off is measured, not guessed
  • Adapter method is justified by hardware reality
  • Decision memo cleanly distinguishes prompt vs PEFT vs RAG
  • README has a plain-English “when to use which” conclusion

Common pitfalls

  • Counting only weight memory and ignoring KV cache (explainer §4.1-§4.2)
  • Treating GGUF as a quantization method rather than a serving artifact / format (explainer §3.7)
  • Claiming int4 is “basically free” without eval evidence (explainer §3.6)
  • Fine-tuning to solve a freshness problem that should be RAG (explainer §5.6)
  • Using too tiny or too noisy a dataset and then overclaiming the result

Stretch goal

Add a small section to decision_memo.md titled:

“If this system moved to a private-doc setting tomorrow, what would Module 07 change?”

That forces the bridge from fine-tuning to RAG.

Why this hands_on_lab matters

A Lead AI Engineer is asked versions of the same question again and again:

  • Can we make it fit?
  • Can we make it cheaper?
  • Can we make it better for our domain?
  • Do we really need to retrain, or do we just need better retrieval?

This hands_on_lab gives you the habit that matters: measure first, then decide.