Skip to content

03. Week 6 — Quantization & Fine-Tuning

For deep understanding see 02_explainer.md — narrative, diagrams, worked memory math, retrieval prompts. This file is the quick-reference glossary: formulas, tables, definitions, and decision frameworks.

Section 1 — Deployment failure and raw memory math

The first question is boring and essential:

Can the model fit?

Raw weight memory

raw_weight_memory ≈ parameter_count × bytes_per_parameter
Format Bytes / parameter
fp32 4
fp16 2
bf16 2
int8 1
int4 0.5

Reference table

Model size fp32 fp16 / bf16 int8 int4 raw
7B 28GB 14GB 7GB 3.5GB
13B 52GB 26GB 13GB 6.5GB
70B 280GB 140GB 70GB 35GB

Cross-ref: 02_explainer.md §1.1-§1.2, §2.5.

Section 2 — Number formats: fp32, fp16, bf16, int8, int4

Format Bits Practical meaning Best use
fp32 32 high precision + wide range reference math, some optimizer states
fp16 16 more local precision than bf16, narrower range inference, mixed precision
bf16 16 fp32-like exponent range, less mantissa training default
int8 8 bucketized representation with scale conservative inference quantization
int4 4 aggressive bucketization max compression / QLoRA base

Bit structure intuition

Format Sign Exponent Mantissa
fp32 1 8 23
fp16 1 5 10
bf16 1 8 7

The key trade-off

  • Precision = how finely nearby values can be separated
  • Range = how small/large values can be represented without underflow/overflow

Why bf16 training is usually better than fp16

  • Same 16-bit footprint
  • fp32-like exponent range
  • Fewer overflows/underflows in training
  • Less dependence on loss scaling

One-line answer: bf16 sacrifices some local precision to preserve much safer dynamic range, and training usually benefits more from the range.

Cross-ref: 02_explainer.md §2.2-§2.4.

Section 3 — Quantization basics

Core equations

q = round(w / s)
w_hat = q * s

Where: - w = original floating weight - s = scale - q = integer bucket - w_hat = dequantized approximation

Symmetric quantization intuition

  • choose a max magnitude
  • map it to max integer level
  • round everything else onto that grid

The inevitable cost

  • rounding error
  • information loss
  • possible quality degradation on edge cases

Per-tensor vs per-channel

Scheme Scale granularity Pros Cons
per-tensor one scale per tensor simple, low metadata small channels can get crushed
per-channel one scale per channel / row / output dim better quality more metadata / complexity
group-wise one scale per block compromise tuning/runtime complexity

Cross-ref: 02_explainer.md §3.1-§3.3.

Section 4 — GPTQ, AWQ, and what they actually optimize

GPTQ

  • post-training quantization
  • calibration-based
  • tries to minimize layer-output reconstruction error on representative inputs
  • excellent for offline deployment artifacts

Memory hook: GPTQ = “protect the output behavior during rounding.”

AWQ

  • activation-aware weight quantization
  • identifies salient channels/weights using representative activations
  • protects weights that matter most under real activations

Memory hook: AWQ = “importance is weight × activation, not weight alone.”

Important distinction

GGUF is a model file format / ecosystem artifact, not the same thing as the quantization algorithm itself.

What tends to degrade first at int4

  • exact formatting
  • code quality
  • multilingual edge cases
  • small-label classification boundaries
  • multimodal quality

Cross-ref: 02_explainer.md §3.4-§3.7.

Section 5 — KV cache, MQA, GQA, PagedAttention

KV cache memory

KV bytes ≈ concurrency × seq_len × layers × kv_heads × head_dim × 2 × bytes_per_value

2 is for K and V.

70B-style 8K example (GQA)

Assume: - 80 layers - 8 KV heads - head dim 128 - 8K context - bf16 cache (2 bytes)

Then per request is roughly:

≈ 2.5GB

So 8 concurrent 8K requests:

≈ 20GB KV cache

MQA vs GQA vs standard MHA

Method KV heads Memory effect Trade-off
MHA one KV per query head baseline highest flexibility
GQA fewer KV heads than query heads big KV savings small quality trade-off
MQA one shared KV head massive KV savings more aggressive sharing

PagedAttention

  • allocates KV cache in blocks/pages
  • reduces fragmentation
  • improves utilization under variable request lengths
  • enables better serving throughput

Cross-ref: 02_explainer.md §4.1-§4.6.

Section 6 — LoRA, QLoRA, and adapter methods

LoRA formula

W_finetuned = W_base + (alpha / r) * A @ B

Where: - W_base frozen - A shape d x r - B shape r x k - r small (8, 16, 32, 64)

Why LoRA is efficient

For a 4096 x 4096 matrix: - full update = 16,777,216 params - rank-16 LoRA = 131,072 params - about 0.78% of full matrix params

QLoRA

  • frozen 4-bit base model
  • train LoRA adapters on top
  • huge memory savings relative to full fine-tuning
  • great for limited-hardware adaptation

Other PEFT methods

Method Idea Typical use
prompt tuning learn soft prompt vectors very lightweight steering
prefix tuning learn trainable attention prefixes stronger than soft prompts
classic adapters insert small trainable modules explicit modular adaptation
IA3-style methods learn scaling vectors ultra-light adaptation

Cross-ref: 02_explainer.md §5.1-§5.5.

Section 7 — Decision framework: prompt vs fine-tune vs RAG

Problem smell Best first lever Why
Format / instruction quality is weak prompt cheapest change
Knowledge is changing / private / fresh RAG do not bake volatile knowledge into weights
Stable domain behavior / tone / schema needs repetition LoRA / PEFT behavior change is systematic
Large, strategic capability shift with lots of data + budget full fine-tune highest capacity, highest cost
Limited hardware but need learned adaptation QLoRA compressed base + small trainable overlay

Cross-ref: 02_explainer.md §5.6, §6.6.

Section 8 — Quick interview answers

Why bf16 over fp16 for training?

Because bf16 keeps the 8-bit exponent of fp32, so it is much more numerically stable for large/small activations and gradients.

Why per-channel over per-tensor?

Because different channels often have different natural scales; one global scale can erase small-but-important channels.

GPTQ vs AWQ?

GPTQ minimizes reconstruction error on calibration data; AWQ uses activation importance to protect salient weights/channels.

Why can a quantized model still OOM?

Because KV cache grows with sequence length and concurrency and is often still stored in fp16/bf16.

Why LoRA?

Because full fine-tuning updates all weights; LoRA learns a low-rank overlay with a tiny fraction of parameters.

When is RAG better than fine-tuning?

When the missing capability is knowledge freshness / private-doc access, not stable behavior change.

Reading list

  1. 02_explainer.md — primary narrative
  2. LoRA paper (Hu et al., 2021)
  3. QLoRA paper (Dettmers et al., 2023)
  4. GPTQ + AWQ papers / blogs
  5. vLLM / PagedAttention blog posts

Reference material

YouTube

Blogs

Self-check

  1. Do raw weight-memory math for 7B, 13B, and 70B. (explainer §1-§2)
  2. Explain bf16 vs fp16 without hand-waving. (§2.3-§2.4)
  3. Explain per-tensor vs per-channel with numbers. (§3.2-§3.3)
  4. Derive why KV cache scales with seq length and concurrency. (§4.2)
  5. Explain LoRA vs QLoRA vs RAG as different tools for different problems. (§5.4-§5.6)