03. Week 6 — Quantization & Fine-Tuning¶

For deep understanding see 02_explainer.md — narrative, diagrams, worked memory math, retrieval prompts. This file is the quick-reference glossary: formulas, tables, definitions, and decision frameworks.

Section 1 — Deployment failure and raw memory math¶

The first question is boring and essential:

Can the model fit?

Raw weight memory¶

raw_weight_memory ≈ parameter_count × bytes_per_parameter

Format	Bytes / parameter
fp32	4
fp16	2
bf16	2
int8	1
int4	0.5

Reference table¶

Model size	fp32	fp16 / bf16	int8	int4 raw
7B	28GB	14GB	7GB	3.5GB
13B	52GB	26GB	13GB	6.5GB
70B	280GB	140GB	70GB	35GB

Cross-ref: 02_explainer.md §1.1-§1.2, §2.5.

Section 2 — Number formats: fp32, fp16, bf16, int8, int4¶

Format	Bits	Practical meaning	Best use
fp32	32	high precision + wide range	reference math, some optimizer states
fp16	16	more local precision than bf16, narrower range	inference, mixed precision
bf16	16	fp32-like exponent range, less mantissa	training default
int8	8	bucketized representation with scale	conservative inference quantization
int4	4	aggressive bucketization	max compression / QLoRA base

Bit structure intuition¶

Format	Sign	Exponent	Mantissa
fp32	1	8	23
fp16	1	5	10
bf16	1	8	7

The key trade-off¶

Precision = how finely nearby values can be separated
Range = how small/large values can be represented without underflow/overflow

Why bf16 training is usually better than fp16¶

Same 16-bit footprint
fp32-like exponent range
Fewer overflows/underflows in training
Less dependence on loss scaling

One-line answer: bf16 sacrifices some local precision to preserve much safer dynamic range, and training usually benefits more from the range.

Cross-ref: 02_explainer.md §2.2-§2.4.

Section 3 — Quantization basics¶

Core equations¶

q = round(w / s)
w_hat = q * s

Where: - w = original floating weight - s = scale - q = integer bucket - w_hat = dequantized approximation

Symmetric quantization intuition¶

choose a max magnitude
map it to max integer level
round everything else onto that grid

The inevitable cost¶

rounding error
information loss
possible quality degradation on edge cases

Per-tensor vs per-channel¶

Scheme	Scale granularity	Pros	Cons
per-tensor	one scale per tensor	simple, low metadata	small channels can get crushed
per-channel	one scale per channel / row / output dim	better quality	more metadata / complexity
group-wise	one scale per block	compromise	tuning/runtime complexity

Cross-ref: 02_explainer.md §3.1-§3.3.

Section 4 — GPTQ, AWQ, and what they actually optimize¶

GPTQ¶

post-training quantization
calibration-based
tries to minimize layer-output reconstruction error on representative inputs
excellent for offline deployment artifacts

Memory hook: GPTQ = “protect the output behavior during rounding.”

AWQ¶

activation-aware weight quantization
identifies salient channels/weights using representative activations
protects weights that matter most under real activations

Memory hook: AWQ = “importance is weight × activation, not weight alone.”

Important distinction¶

GGUF is a model file format / ecosystem artifact, not the same thing as the quantization algorithm itself.

What tends to degrade first at int4¶

exact formatting
code quality
multilingual edge cases
small-label classification boundaries
multimodal quality

Cross-ref: 02_explainer.md §3.4-§3.7.

Section 5 — KV cache, MQA, GQA, PagedAttention¶

KV cache memory¶

KV bytes ≈ concurrency × seq_len × layers × kv_heads × head_dim × 2 × bytes_per_value

2 is for K and V.

70B-style 8K example (GQA)¶

Assume: - 80 layers - 8 KV heads - head dim 128 - 8K context - bf16 cache (2 bytes)

Then per request is roughly:

≈ 2.5GB

So 8 concurrent 8K requests:

≈ 20GB KV cache

MQA vs GQA vs standard MHA¶

Method	KV heads	Memory effect	Trade-off
MHA	one KV per query head	baseline	highest flexibility
GQA	fewer KV heads than query heads	big KV savings	small quality trade-off
MQA	one shared KV head	massive KV savings	more aggressive sharing

PagedAttention¶

allocates KV cache in blocks/pages
reduces fragmentation
improves utilization under variable request lengths
enables better serving throughput

Cross-ref: 02_explainer.md §4.1-§4.6.

Section 6 — LoRA, QLoRA, and adapter methods¶

LoRA formula¶

W_finetuned = W_base + (alpha / r) * A @ B

Where: - W_base frozen - A shape d x r - B shape r x k - r small (8, 16, 32, 64)

Why LoRA is efficient¶

For a 4096 x 4096 matrix: - full update = 16,777,216 params - rank-16 LoRA = 131,072 params - about 0.78% of full matrix params

QLoRA¶

frozen 4-bit base model
train LoRA adapters on top
huge memory savings relative to full fine-tuning
great for limited-hardware adaptation

Other PEFT methods¶

Method	Idea	Typical use
prompt tuning	learn soft prompt vectors	very lightweight steering
prefix tuning	learn trainable attention prefixes	stronger than soft prompts
classic adapters	insert small trainable modules	explicit modular adaptation
IA3-style methods	learn scaling vectors	ultra-light adaptation

Cross-ref: 02_explainer.md §5.1-§5.5.

Section 7 — Decision framework: prompt vs fine-tune vs RAG¶

Problem smell	Best first lever	Why
Format / instruction quality is weak	prompt	cheapest change
Knowledge is changing / private / fresh	RAG	do not bake volatile knowledge into weights
Stable domain behavior / tone / schema needs repetition	LoRA / PEFT	behavior change is systematic
Large, strategic capability shift with lots of data + budget	full fine-tune	highest capacity, highest cost
Limited hardware but need learned adaptation	QLoRA	compressed base + small trainable overlay

Cross-ref: 02_explainer.md §5.6, §6.6.

Section 8 — Quick interview answers¶

Why bf16 over fp16 for training?¶

Because bf16 keeps the 8-bit exponent of fp32, so it is much more numerically stable for large/small activations and gradients.

Why per-channel over per-tensor?¶

Because different channels often have different natural scales; one global scale can erase small-but-important channels.

GPTQ vs AWQ?¶

GPTQ minimizes reconstruction error on calibration data; AWQ uses activation importance to protect salient weights/channels.

Why can a quantized model still OOM?¶

Because KV cache grows with sequence length and concurrency and is often still stored in fp16/bf16.

Why LoRA?¶

Because full fine-tuning updates all weights; LoRA learns a low-rank overlay with a tiny fraction of parameters.

When is RAG better than fine-tuning?¶

When the missing capability is knowledge freshness / private-doc access, not stable behavior change.

Reading list¶

02_explainer.md — primary narrative
LoRA paper (Hu et al., 2021)
QLoRA paper (Dettmers et al., 2023)
GPTQ + AWQ papers / blogs
vLLM / PagedAttention blog posts

Reference material¶

YouTube¶

Tim Dettmers | QLoRA: Efficient Finetuning of Quantized Large Language Models — best single talk on why 4-bit bases + adapters changed practical fine-tuning.
A Hacker's Guide to Language Models — practical intuition for adaptation choices, evals, and serving reality.

Blogs¶

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA — practical walkthrough of 4-bit loading and adapter fine-tuning.
PEFT: Parameter-Efficient Fine-Tuning of Billion-Scale Models on Low-Resource Hardware — overview of the major PEFT family and trade-offs.

Self-check¶

Do raw weight-memory math for 7B, 13B, and 70B. (explainer §1-§2)
Explain bf16 vs fp16 without hand-waving. (§2.3-§2.4)
Explain per-tensor vs per-channel with numbers. (§3.2-§3.3)
Derive why KV cache scales with seq length and concurrency. (§4.2)
Explain LoRA vs QLoRA vs RAG as different tools for different problems. (§5.4-§5.6)