Skip to content

Assignment 6 — Quantize, Adapt, Decide

This folder implements a runnable local evidence pack for Module 06.

Files

  • memory_math.md — raw weight-memory math and KV-cache caveat
  • benchmark_results.md — comparison table companion
  • train.py — LoRA training script
  • eval.py — repeatable benchmark for base, local INT8 stand-in, and LoRA adapter
  • decision_memo.md — prompt vs PEFT vs RAG recommendation
  • config.yaml — model, dataset, and LoRA settings
  • data/train.jsonl and data/eval.jsonl — small domain dataset

What this workspace does

It gives you a local, runnable version of the Week 6 decision loop:

  1. calculate memory math
  2. benchmark the base model
  3. benchmark a quantized stand-in
  4. adapt with LoRA
  5. compare results and write the recommendation

Important scope note

The hands_on_lab spec asks for GPTQ, AWQ, or GGUF-style quantization evidence. This workspace does not claim to replace that.

Instead, it uses a local INT8 smoke quantizer for GPT-2-style layers so the benchmark/eval path can be validated in this environment. For the real hands_on_lab, keep the same evaluation flow and swap in:

  • a GPTQ artifact,
  • an AWQ artifact,
  • or a GGUF-served model you can benchmark consistently.

Commands

python3 train.py --config config.yaml --max-train-samples 8 --max-eval-samples 4
python3 eval.py --config config.yaml --adapter-path outputs/lora_adapter --max-eval-samples 4

Smoke validation snapshot

The local tiny-model smoke path completed in this workspace.

  • base fp16-style weight memory: 0.000191 GB
  • local int8 smoke weight memory: 0.000189 GB
  • LoRA adapter runtime weight memory: 0.000385 GB
  • base tokens/sec: 2160.038
  • local int8 smoke tokens/sec: 1968.874
  • LoRA adapter tokens/sec: 2050.982

Interpretation:

  • the local INT8 stand-in slightly reduced stored weight memory
  • it did not improve end-to-end CPU latency here
  • the LoRA path trained and benchmarked successfully, which is the main workflow proof for this module

Expected conclusion pattern

  • quantization helps fit and usually helps cost
  • quantization does not erase KV-cache growth
  • LoRA helps behavior specialization
  • LoRA is not the right fix for fresh private knowledge
  • RAG becomes the better move when facts change faster than you want to retrain