Assignment 6 — Quantize, Adapt, Decide¶

This folder implements a runnable local evidence pack for Module 06.

Files¶

memory_math.md — raw weight-memory math and KV-cache caveat
benchmark_results.md — comparison table companion
train.py — LoRA training script
eval.py — repeatable benchmark for base, local INT8 stand-in, and LoRA adapter
decision_memo.md — prompt vs PEFT vs RAG recommendation
config.yaml — model, dataset, and LoRA settings
data/train.jsonl and data/eval.jsonl — small domain dataset

What this workspace does¶

It gives you a local, runnable version of the Week 6 decision loop:

calculate memory math
benchmark the base model
benchmark a quantized stand-in
adapt with LoRA
compare results and write the recommendation

Important scope note¶

The hands_on_lab spec asks for GPTQ, AWQ, or GGUF-style quantization evidence. This workspace does not claim to replace that.

Instead, it uses a local INT8 smoke quantizer for GPT-2-style layers so the benchmark/eval path can be validated in this environment. For the real hands_on_lab, keep the same evaluation flow and swap in:

a GPTQ artifact,
an AWQ artifact,
or a GGUF-served model you can benchmark consistently.

Commands¶

python3 train.py --config config.yaml --max-train-samples 8 --max-eval-samples 4
python3 eval.py --config config.yaml --adapter-path outputs/lora_adapter --max-eval-samples 4

Smoke validation snapshot¶

The local tiny-model smoke path completed in this workspace.

base fp16-style weight memory: 0.000191 GB
local int8 smoke weight memory: 0.000189 GB
LoRA adapter runtime weight memory: 0.000385 GB
base tokens/sec: 2160.038
local int8 smoke tokens/sec: 1968.874
LoRA adapter tokens/sec: 2050.982

Interpretation:

the local INT8 stand-in slightly reduced stored weight memory
it did not improve end-to-end CPU latency here
the LoRA path trained and benchmarked successfully, which is the main workflow proof for this module

Expected conclusion pattern¶

quantization helps fit and usually helps cost
quantization does not erase KV-cache growth
LoRA helps behavior specialization
LoRA is not the right fix for fresh private knowledge
RAG becomes the better move when facts change faster than you want to retrain