02. Quantization & Fine-Tuning — Narrative Explainer¶
Companion to 03_study_material.md.
That file is the lookup sheet.
This file is the story, the mental picture, and the deployment math.
Table of contents¶
- ELI5 — the sculptor, the blueprint, the field notes, the overlay sketch
- Chapter 1: The 70B model that does not fit
- 1.1 Opening failure
- 1.2 The first memory math
- 1.3 Why this matters to a Lead AI Engineer
- Chapter 2: Numbers are storage decisions
- 2.1 Bits are budget
- 2.2 fp32, fp16, bf16, int8, int4
- 2.3 Precision vs range
- 2.4 Why bf16 training is usually better than fp16
- 2.5 Worked memory calculations
- Chapter 3: Quantization methods
- 3.1 The core trick
- 3.2 A rounding example with real numbers
- 3.3 Per-tensor vs per-channel
- 3.4 GPTQ
- 3.5 AWQ
- 3.6 What usually breaks first
- 3.7 How to choose in practice
- Chapter 4: KV cache and inference optimization
- 4.1 Why weights are only half the story
- 4.2 KV cache memory math
- 4.3 MQA
- 4.4 GQA
- 4.5 PagedAttention
- 4.6 Serving mental models
- Chapter 5: Parameter-efficient fine-tuning
- 5.1 Why full fine-tuning hurts
- 5.2 LoRA
- 5.3 The low-rank picture
- 5.4 QLoRA
- 5.5 Other adapter methods
- 5.6 Full fine-tune vs PEFT vs prompt vs RAG
- 5.7 Retrieval prompts
- 5.8 Honest admission
- Chapter 6: Recap and application
- 6.1 Failure-fix chain
- 6.2 Key points to remember
- 6.3 Important interview questions
- 6.4 Production experience with memory calculations
- 6.5 Apply now — exercises
- 6.6 Foundation-gap audit for Module 07
- 6.7 Bridge to the next module
ELI5 — the sculptor, the blueprint, the field notes, the overlay sketch¶
Imagine a sculptor who owns a huge warehouse full of very detailed blueprints. Those blueprints describe every statue, every curve, every screw position. That full warehouse is powerful. But there is a problem. The construction site is far away. The road is narrow. The truck is small. So the sculptor cannot carry the entire warehouse to the site. Here are the five placeholders for the whole module. - the blueprint = full-precision weights - the field notes = quantized weights - the rounding error = quantization noise - the overlay sketch = LoRA adapter - the site constraint = GPU memory Now keep these in your head. The original detailed blueprint is like fp16 or fp32 weights. Very accurate. Very large. Heavy to carry. The sculptor makes compressed field notes. The notes do not copy every tiny detail. They keep the important shape. This is quantization. You lose some detail. That loss is the rounding error. But if the notes are made carefully, the sculpture still looks correct to most people. Good. Now second problem. Suppose this site is not building a generic statue. Suppose it is building a very special temple gate for one city. The old blueprint is still useful. But the sculptor adds a thin transparent overlay on top with just the custom changes. That thin extra layer is the overlay sketch. This is LoRA. The base model stays mostly unchanged. The task-specific behavior is stored in a small add-on. So the whole module is one sentence. When the blueprint is too large for the site constraint, you carry field notes instead; when the building needs local customization, you add an overlay sketch on top. That is quantization plus parameter-efficient fine-tuning. One more subtle point. The field notes are not magic. If you compress too aggressively, you may lose a wall, a measurement, or a corner. In model terms, that means lower accuracy, worse reasoning on edge cases, or broken output format. So the job is not “compress as much as possible.” The job is “compress enough to fit the site constraint, but not so much that the building becomes unusable.” And if the building needs company-specific details, do not redraw the entire blueprint warehouse. Just add the overlay sketch. That is the engineer’s mindset. Not worship. Trade-off.
Chapter 1: The 70B model that does not fit¶
1.1 Opening failure¶
You have a 70B model.
It needs 140GB in fp16.
Your GPU has 80GB.
You cannot serve it.
What do you do?
Pause there.
This is not a toy problem.
This is the production problem.
A lot of AI-engineering discussion becomes vague because people talk about models as if model choice is only about benchmark scores.
It is not.
A model is also a memory object.
A latency object.
A cost object.
A batching object.
A concurrency object.
Let us do the most basic math.
70B parameters × 2 bytes ≈ 140GB
More precisely, if you use decimal GB, yes, roughly 140GB.
If you use binary GiB, it is a little smaller.
But operationally the conclusion is the same.
An 80GB GPU cannot hold the weights in fp16.
And be careful — weights are not the whole bill.
You still need runtime overhead.
You still need the KV cache.
You still need scratch buffers.
So “140GB vs 80GB” is already impossible before the rest of the system even enters the room.
1.2 The first memory math¶
Here is the raw weight memory table.
| Model size | fp32 | fp16 / bf16 | int8 | int4 raw |
|---|---:|---:|---:|---:|
| 7B | 28GB | 14GB | 7GB | 3.5GB |
| 13B | 52GB | 26GB | 13GB | 6.5GB |
| 70B | 280GB | 140GB | 70GB | 35GB |
Do not memorize only the table.
Memorize the rule.
memory ≈ parameter_count × bytes_per_parameter
That is the first lever.
If fp16 does not fit, you have five broad options.
1. Use more GPUs.
2. Use a smaller model.
3. Quantize the model.
4. Distill to a smaller model.
5. Change the product requirement so the biggest model is not necessary.
Now think like a Lead AI Engineer.
Each option has a consequence.
More GPUs means higher infra cost and often more serving complexity.
Smaller model means possible quality loss.
Quantization means possible quality loss but often much better cost-efficiency.
Distillation means extra engineering effort and eval work.
Changing the requirement means product negotiation, not just technical work.
That is why this module matters.
The engineer who can do this math and frame this trade-off is valuable.
The engineer who only says “use the latest 70B” is expensive and dangerous.
1.3 Why this matters to a Lead AI Engineer¶
At Lead level, the question is almost never “what is quantization?” The question is more like: - Can we serve this model on our current hardware budget? - Can we meet the latency SLO? - Can we survive long-context traffic? - Should we fine-tune, or should we do RAG, or should we just improve the prompt? - Are we paying for weights we do not need? Notice the pattern. The title is not “Senior Benchmark Reader.” The title is Lead AI Engineer. That means deployment decisions. That means cost optimization. That means capacity planning. That means knowing when not to fine-tune. Good. Now we need the language of storage. Because quantization is just storage discipline applied to neural networks.
Chapter 2: Numbers are storage decisions¶
2.1 Bits are budget¶
A number format is not an abstract math decoration. It is a storage budget. It answers two questions. 1. How much detail can I keep? 2. How wide a range can I represent? If you want extreme detail, you usually spend more bits on the mantissa. If you want a huge range, you spend more bits on the exponent. If you want tiny storage, you throw away both. That is the whole trade-off.
2.2 fp32, fp16, bf16, int8, int4¶
Here is the quick map. | Format | Bits | Rough structure | Strength | Weakness | Typical use | |---|---:|---|---|---|---| | fp32 | 32 | 1 sign / 8 exp / 23 mantissa | High precision + wide range | Heavy memory | Training reference, some optimizer states | | fp16 | 16 | 1 sign / 5 exp / 10 mantissa | Good precision per bit | Narrower range | Inference, mixed-precision training | | bf16 | 16 | 1 sign / 8 exp / 7 mantissa | Wide range like fp32 | Less local precision than fp16 | Modern training default | | int8 | 8 | fixed integer levels | 2× smaller than fp16 | Needs scale mapping | Inference quantization | | int4 | 4 | very few integer levels | 4× smaller than fp16 | Much more rounding error | Aggressive inference / QLoRA base | Floating point is dynamic. It can represent very large and very small values because of the exponent. Integer quantization is much more rigid. It says: “I will store only a small set of buckets, then use a scale to map them back.” This is why quantization works surprisingly well for weights. The network can tolerate some bucketization. But not infinite bucketization.
2.3 Precision vs range¶
Let us separate two ideas that beginners mix up. - Precision = how finely you can distinguish nearby values - Range = how far outward you can go without overflow or underflow Now the visual.
Number-line picture — local precision¶
Imagine you want to represent values near 1.0.
More available points means better local precision.
Near 1.0 on the number line
int4 : 0.57 0.71 0.86 1.00 1.14 1.29 1.43
int8 : 0.97 0.98 0.99 1.00 1.01 1.02 1.03
fp16 : many closely packed representable values
bf16 : fewer nearby points than fp16, but still enough for many training ops
fp32 : even denser nearby points
Number-line picture — dynamic range¶
Now look at how far the format can stretch.
Smaller magnitudes <-----------------------------------------------> Larger magnitudes
int4 int8 fp16 bf16 ≈ fp32
very narrow narrow moderate very wide
2.4 Why bf16 training is usually better than fp16¶
Now listen carefully. This is a classic interview question. fp16 and bf16 both use 16 bits. So why is bf16 usually preferred for training? Because bf16 keeps the 8-bit exponent of fp32. That gives it a much wider dynamic range than fp16. fp16 has only a 5-bit exponent. That narrower range means more risk of overflow and underflow during training. In older fp16 training stacks, people often used loss scaling to keep gradients from disappearing. Loss scaling is a workaround. bf16 reduces the need for that workaround. So the answer is not “bf16 is more accurate.” The better answer is: bf16 is usually better for training because it preserves fp32-like range, which improves numerical stability in the presence of widely varying activations and gradients. One sentence more. Training is a stability problem. Inference is more often a storage-and-throughput problem. That is why the best format for training and the best format for deployment are often different.
2.5 Worked memory calculations¶
Let us do the kind of mental math you should be able to do in an interview without a calculator.
Example A — 7B model¶
7B × 2 bytes = 14GB
So a 7B model in fp16 or bf16 is about 14GB just for the weights.
That means a 24GB GPU can often hold it for inference.
Not always comfortably.
But often.
Example B — 13B model in int8¶
13B × 1 byte = 13GB
Good.
Now 13B int8 fits much more comfortably than 13B fp16.
That is the immediate business value of quantization.
Example C — 70B model in int4¶
70B × 0.5 bytes = 35GB
This is the headline number.
And yes, this is why people get excited.
A model that was impossible in fp16 now becomes potentially deployable on a single 80GB GPU.
But be disciplined.
35GB is raw packed weight storage.
Real runtimes add metadata, scales, zero-points, kernels, allocator overhead, and sometimes dequantization workspace.
So the actual footprint may be closer to 38–45GB depending on the method and runtime.
Still wonderful.
But not free.
Example D — Why “fits in memory” is not enough¶
Suppose your 70B int4 model takes 42GB effective memory at load time. On an 80GB GPU, you have about 38GB left. Sounds comfortable. Then long-context traffic arrives. Then concurrency arrives. Then the KV cache arrives. Then you learn the painful lesson: weight memory and serving memory are different line items. That is Chapter 4. Before we go there, we need to understand how quantization is actually done.
Chapter 3: Quantization methods¶
3.1 The core trick¶
Quantization maps a floating-point value to a small integer bucket. Then, during use, we approximately reconstruct it with a scale. Very simple version:
Where: -w = original weight
- s = scale
- q = quantized integer
- w_hat = dequantized approximation used in computation
This is the whole game.
Replace a large continuous space with a smaller discrete grid.
Storage drops.
Speed may improve.
Some information is lost.
That lost information is the rounding error from the ELI5 story.
3.2 A rounding example with real numbers¶
Let us quantize four weights into symmetric int4.
Use the integer range [-7, 7].
Original weights:
W = [0.18, -0.91, 1.62, 2.94]
We choose a single scale based on the largest absolute value.
max_abs = 2.94
s = max_abs / 7 = 2.94 / 7 = 0.42
Now quantize.
0.18 / 0.42 = 0.43 -> round -> 0
-0.91 / 0.42 = -2.17 -> round -> -2
1.62 / 0.42 = 3.86 -> round -> 4
2.94 / 0.42 = 7.00 -> round -> 7
Q = [0, -2, 4, 7]
Now dequantize back.
Now compare.
| Original | Dequantized | Error |
|---:|---:|---:|
| 0.18 | 0.00 | -0.18 |
| -0.91 | -0.84 | +0.07 |
| 1.62 | 1.68 | +0.06 |
| 2.94 | 2.94 | 0.00 |
See the pattern?
The small value 0.18 got crushed hardest.
Why?
Because one global scale had to cover the largest value 2.94.
So tiny values receive coarse buckets.
This is the first reason per-tensor quantization can fail.
3.3 Per-tensor vs per-channel¶
Now we make the example more realistic. Suppose a weight matrix has two output channels. One channel has tiny weights. The other has much larger weights.
Per-tensor quantization¶
One scale for the whole matrix.
Largest absolute value is 2.80.
So:
s_tensor = 2.80 / 7 = 0.40
Quantized rows:
Row 1 / 0.40 -> [ 0.30, -0.45, 0.63, -0.78 ] -> [ 0, 0, 1, -1 ]
Row 2 / 0.40 -> [ 4.00, -5.25, 6.00, -7.00 ] -> [ 4, -5, 6, -7 ]
Per-channel quantization¶
Now give each row its own scale.
For Row 1:
s_row1 = 0.31 / 7 ≈ 0.0443
For Row 2:
s_row2 = 2.80 / 7 = 0.40
Quantize Row 1 separately.
0.12 / 0.0443 ≈ 2.71 -> 3
-0.18 / 0.0443 ≈ -4.06 -> -4
0.25 / 0.0443 ≈ 5.64 -> 6
-0.31 / 0.0443 ≈ -7.00 -> -7
3.4 GPTQ¶
Now we move from “plain rounding” to smarter post-training quantization. GPTQ stands for a calibration-based weight quantization approach. The idea is not merely “round every weight independently.” The idea is: Choose quantized weights so that the layer outputs stay close to the original outputs on representative calibration data. This is the important shift. You are not optimizing the weight values directly. You are optimizing the damage to behavior. Very rough objective:
WhereX comes from calibration inputs.
The method uses second-order information to understand which directions are more sensitive.
In plain English:
Some weight errors hurt a lot.
Some weight errors barely matter.
GPTQ tries to spend the quantization pain where the model can tolerate it.
Why people like GPTQ:
- It is post-training.
- You do not need to retrain the whole model.
- It often gives strong int4 quality.
- It is widely supported in open-model tooling.
Where GPTQ shines:
- Offline quantization pipeline
- Single model prepared carefully for deployment
- Weight-only quantization when you have representative calibration prompts
What to remember in one line:
GPTQ is not just rounding; it is error-aware rounding guided by calibration data.
3.5 AWQ¶
AWQ stands for Activation-aware Weight Quantization.
Here the key observation is beautiful.
A small weight is not necessarily unimportant.
Importance depends on activation magnitude too.
A tiny weight multiplied by a huge activation can matter more than a large weight multiplied by a tiny activation.
Let us do a toy example.
Suppose two channels contribute to an output.
Channel A:
weight = 0.20
activation = 20
Contribution:
0.20 × 20 = 4.0
Channel B:
weight = 1.20
activation = 0.5
Contribution:
1.20 × 0.5 = 0.6
See the trap?
If you only look at weight magnitude, Channel B looks more important.
If you look at actual effect on output, Channel A matters much more.
That is why activation-aware methods help.
AWQ uses representative activations to identify salient channels and protect them better during quantization.
In practice, people often find AWQ attractive because it gives strong quality-speed trade-offs for deployment.
What to remember in one line:
AWQ asks not only “how big is the weight?” but also “how much does this weight matter under real activations?”
3.6 What usually breaks first¶
Here is a practical truth. When you move from fp16 to int8, many tasks survive with tiny degradation. When you move from int8 to int4, the risk rises sharply. What breaks first? Usually not the average demo. Usually the edges. Examples: - exact JSON formatting - multilingual long-tail inputs - small but important classification distinctions - code generation consistency - math reasoning on rare patterns - multimodal models, which are often more quantization-sensitive than plain LLMs Why edges break first is intuitive. Average behavior has redundancy. Edge behavior often relies on fragile internal margins. Rounding error pushes those margins the wrong way. This is why evaluation cannot be only “looks fine on three prompts.” You need task-specific evals. Always.
3.7 How to choose in practice¶
Here is the simple field guide.
If you need a conservative quality drop¶
Choose int8 before int4. You lose less accuracy. You save less memory. Good default for enterprise caution.
If you need maximum compression for serving¶
Consider int4 with a proven method like GPTQ or AWQ. But only after evaluating on your actual workload.
If your model has channels with very different scales¶
Per-channel or group-wise quantization is safer than per-tensor.
If you are preparing a deployment artifact¶
GPTQ and AWQ are both serious options. The right choice depends on runtime support, calibration data, model family, and your quality target.
If someone says “we use GGUF”¶
Remember this carefully. GGUF is a format / packaging ecosystem, not the same thing as “the quantization algorithm.” Do not confuse file format with quantization method. That confusion shows up in interviews all the time.
If you are fine-tuning a quantized base model¶
You are entering QLoRA territory, not standard GPTQ deployment territory. That is Chapter 5. Before that, there is another huge memory bill you must understand. The KV cache.
Chapter 4: KV cache and inference optimization¶
4.1 Why weights are only half the story¶
Many engineers do this once. They quantize the weights. They load the model successfully. They feel victorious. Then they increase context length or concurrency. Then the service OOMs. Why? Because during autoregressive generation, the model stores keys and values for past tokens. That stored history is the KV cache. The cache grows with sequence length. It also grows with concurrent requests. So weight compression solves only part of the deployment problem. This is the line to store in your head: Weights are mostly fixed cost; KV cache is traffic-shaped cost. If traffic changes, KV memory changes. If context changes, KV memory changes. If concurrency changes, KV memory changes.
4.2 KV cache memory math¶
Very rough formula:
KV bytes ≈ batch_or_concurrency
× seq_len
× num_layers
× kv_heads
× head_dim
× 2 (K and V)
× bytes_per_value
kv_heads = num_heads.
For GQA, kv_heads is smaller.
That matters a lot.
Let us do a concrete 70B-style example.
Assume:
- num_layers = 80
- num_query_heads = 64
- num_kv_heads = 8 (GQA)
- head_dim = 128
- seq_len = 8192
- bytes_per_value = 2 (bf16 or fp16 cache)
- concurrency = 1 request
Now calculate.
Per request.
Read that again.
About 2.5GB per 8K request even with GQA.
Now concurrency 8.
2.5GB × 8 ≈ 20GB
Now concurrency 16.
2.5GB × 16 ≈ 40GB
So a model whose weights fit after quantization can still fail under traffic because the KV cache multiplies with live requests.
This is one of the most important production intuitions in LLM serving.
4.3 MQA¶
MQA means Multi-Query Attention. Many query heads. But only one shared key head and one shared value head. Visual:
The benefit is obvious. KV cache becomes dramatically smaller because you are not storing separate K and V per query head. Relative memory compared with full multi-head KV cache: - MHA with 64 KV heads -> baseline 1.0× - GQA with 8 KV heads -> 1/8× - MQA with 1 KV head -> 1/64× So for cache memory, MQA is fantastic. The trade-off is representational flexibility. Sharing K and V so aggressively can hurt quality in some settings. That is why GQA became popular as a middle ground.4.4 GQA¶
GQA means Grouped-Query Attention. It says: Many query heads can share one KV head within a group. Visual:
So instead of 64 KV heads, maybe you keep only 8. You save 8× KV memory relative to classic multi-head cache. But you preserve more flexibility than MQA. That is why modern production models often use GQA. It is a lovely engineering compromise. Less memory pain. Little quality pain.4.5 PagedAttention¶
Now the allocator problem. Suppose request A is 400 tokens. Request B is 6000 tokens. Request C is still decoding slowly. If you reserve one big contiguous KV block per request, memory gets fragmented and wasteful. PagedAttention treats KV cache more like virtual memory pages in an operating system. Instead of demanding one giant continuous block, it stores cache in fixed-size pages or blocks. Benefits: - less fragmentation - better memory utilization - easier sharing of common prefixes - better throughput under many requests with different lengths The intuition is straight from operating systems. Do not require a perfectly continuous mansion. Allow memory to be managed in reusable flats. That is PagedAttention. This is why vLLM became such a big deal. Not because attention math changed. Because memory management changed. And memory management is performance in LLM serving.
4.6 Serving mental models¶
Here are four mental models that will save you pain.
Mental model 1 — prefill vs decode¶
During prefill, you ingest the prompt. Compute load is heavy. During decode, you generate token by token. KV cache is already there. Memory pressure dominates more clearly. So throughput tuning is not one knob. Prefill and decode behave differently.
Mental model 2 — quantized weights do not automatically mean quantized KV cache¶
Very important. Many runtimes keep the KV cache in fp16 or bf16 even when the weights are int4. So you may save huge weight memory and still pay big cache memory. This is the reason people say “the model fits, but our long context serving still dies.”
Mental model 3 — concurrency is multiplication¶
If one 8K request uses 2.5GB KV cache, then 10 such requests are not philosophically interesting. They are about 25GB. That is the difference between smooth service and pager duty.
Mental model 4 — the winning deployment is not always the smartest model¶
Suppose model A is slightly better than model B. But model A requires 2 GPUs and has lower concurrency. Model B fits on 1 GPU, is 30% cheaper, and still passes your eval threshold. In production, model B may be the correct decision. That is not compromise. That is engineering.
Chapter 5: Parameter-efficient fine-tuning¶
5.1 Why full fine-tuning hurts¶
Let us say the base model already knows English, code, and general world structure. You do not want to relearn all of language. You want it to behave better for your task. Full fine-tuning updates every parameter. That is expensive. Why expensive? Because training memory is not just weights. You also need: - gradients - optimizer states - activations for backward pass - framework overhead A rough rule of thumb for Adam-style full fine-tuning is that training memory can be many times larger than raw weight memory. So a model that fits for inference may still be impossible to fine-tune fully on your hardware. This is why PEFT exists. PEFT = Parameter-Efficient Fine-Tuning. Idea: Keep most of the model frozen. Train only a small number of additional parameters. The star of this family is LoRA.
5.2 LoRA¶
LoRA means Low-Rank Adaptation.
Suppose a weight matrix is W.
Instead of updating all of W, LoRA learns a low-rank update ΔW.
Formula:
W_base is frozen
- A has shape d x r
- B has shape r x k
- r is small, like 8, 16, 32, or 64
The key is that r is much smaller than d and k.
So instead of learning all d x k values, you learn only d x r + r x k values.
That can be dramatically smaller.
Concrete parameter example¶
Take one square matrix of size 4096 x 4096.
Full update size:
4096 × 4096 = 16,777,216 parameters
LoRA with rank r = 16:
4096 × 16 + 16 × 4096 = 131,072 parameters
Now compare.
131,072 / 16,777,216 ≈ 0.0078
So LoRA is about 0.78% of the full matrix parameter count in this example.
That is why it is so attractive.
You are not repainting the whole building.
You are adding the overlay sketch from the ELI5 story.
5.3 The low-rank picture¶
Here is the visual.
Full matrix update
Delta W (4096 x 4096)
Too large to train directly on small hardware.
Approximate it as:
A (4096 x r) @ B (r x 4096)
where r is small:
r = 8 -> very thin update
r = 16 -> common practical choice
r = 32 -> more capacity
r = 64 -> even more capacity, more memory/compute
Where LoRA is usually attached¶
Common targets:
- attention projection matrices (q_proj, k_proj, v_proj, o_proj)
- MLP projection layers
Some recipes attach LoRA only to Q and V.
Some attach to more linear layers.
More targets = more capacity.
More targets = more trainable parameters.
Same trade-off again.
5.4 QLoRA¶
QLoRA combines two ideas. 1. Store the base model in 4-bit form. 2. Train LoRA adapters on top. So the base stays frozen and compressed. The trainable part stays small. That is why QLoRA made fine-tuning on limited hardware dramatically more accessible. One careful phrasing. People sometimes overstate the hardware headline. A better way to say it is: QLoRA lets you fine-tune models that would be impractical to full-fine-tune on the same hardware, because the frozen base is 4-bit and only the adapters receive gradient updates. In the original work, NF4 quantization and memory tricks like paged optimizers mattered. You do not need to memorize all kernel details today. But you must understand the architecture. - frozen quantized base - small trainable adapters - backprop only through the small adapter weights That is the reason the memory bill drops so much.
Rule-of-thumb memory picture¶
For a 7B model: - fp16 base weights ≈ 14GB raw - 4-bit base weights ≈ 3.5GB raw, maybe ~5–6GB effective with overhead - LoRA adapters = comparatively tiny So on a 24GB GPU, QLoRA can be feasible where full fine-tuning is not. Again, sequence length and activations still matter. The base weights are not the whole bill. Never forget activations during training.
5.5 Other adapter methods¶
LoRA is famous. But it is not alone. Here is the family picture.
Prompt tuning¶
Learn a small set of soft prompt vectors at the input. Very lightweight. Less expressive for deeper behavior change.
Prefix tuning¶
Learn trainable prefix vectors that influence attention as if extra virtual tokens were prepended. More expressive than just soft prompts. Still lighter than full fine-tuning.
Classic adapters¶
Insert small trainable modules between frozen layers. More architectural changes. Still PEFT.
IA3 and related methods¶
Learn small scaling vectors that modulate existing activations or weights. Even smaller than LoRA in some cases. What to remember: - LoRA is the most common mental model - prompt/prefix tuning are lighter but narrower - classic adapters add explicit modules - all are attempts to change behavior without updating every base weight
5.6 Full fine-tune vs PEFT vs prompt vs RAG¶
This is the decision table people actually need. | Situation | Best first move | Why | |---|---|---| | Model already knows the knowledge; output style is the problem | Better prompt | Cheapest lever | | Knowledge is private or changes weekly | RAG | Do not bake fresh knowledge into weights | | Need stable format / tone / domain behavior across many requests | LoRA / PEFT | Behavior change repeats, data is stable | | Need deep capability shift and you have lots of high-quality data + budget | Full fine-tune | Maximum capacity, maximum cost | | Need limited-hardware fine-tuning | QLoRA | Quantized frozen base + small adapters | This is a crucial bridge to the next module. Do not use fine-tuning to solve a freshness problem. That is what RAG is for. Do not use RAG to solve a stable behavior-formatting problem if the model must always respond in a specific structured way and you have enough examples. That is where PEFT may help. And do not fine-tune before checking whether a stronger prompt already solves it. Senior behavior is often “try the cheapest lever first.”
5.7 Retrieval prompts¶
Use these when you want to pull the whole chapter back into memory quickly. 1. “A 70B model in fp16 does not fit on one 80GB GPU. Walk me through every memory term I should count before deciding on quantization.” 2. “Why is bf16 usually preferred to fp16 for training even though fp16 has more mantissa bits?” 3. “Show me with actual numbers why per-tensor quantization can destroy a small channel and why per-channel fixes it.” 4. “Explain GPTQ vs AWQ in plain engineering language: what signal do they use to decide what errors are acceptable?” 5. “When should I use LoRA, when QLoRA, when full fine-tuning, and when should I stop and use RAG instead?” If you can answer those five from memory, the module is sitting in your head properly.
5.8 Honest admission¶
Let us be honest. After this module, you should be able to reason well about quantization and fine-tuning trade-offs. But you should not pretend to be an expert in everything. For example, this module does not make you an expert in: - writing custom CUDA quantization kernels from scratch - deriving GPTQ second-order updates mathematically from first principles - implementing FlashAttention internals at kernel level - proving exactly when low-rank adaptation will match full fine-tuning - guaranteeing that int4 quality loss will be negligible for every task That is fine. Senior credibility comes from knowing both what you know and what you do not know. Your honest claim after this module is stronger and cleaner: I can do the memory math, explain the major methods, choose the right lever for a product situation, and design the evals that tell us whether the trade-off is acceptable. That is already very useful.
Chapter 6: Recap and application¶
6.1 Failure-fix chain¶
This table is the whole module compressed. | Failure | Symptom | Fix | Why it helps | |---|---|---|---| | 70B fp16 weights do not fit | Model will not load on target GPU | Quantize to int8 or int4 | Fewer bits per parameter | | fp16 training is numerically unstable | overflow / underflow / loss scaling pain | Use bf16 | Wider exponent range | | One global scale crushes small channels | small values become zero | Per-channel or group-wise quantization | Local scales preserve structure | | Naive rounding hurts accuracy too much | eval score drops sharply | GPTQ | Minimizes output reconstruction error on calibration data | | Weight magnitude alone misses salient channels | some important paths degrade badly | AWQ | Uses activation information to protect important weights | | Model fits at load time but OOMs under long context | service crashes under real traffic | Count KV cache separately | Serving memory grows with seq length and concurrency | | KV cache is too large with standard multi-head setup | poor concurrency | GQA or MQA | Share K/V across query heads | | KV allocation wastes memory across variable request lengths | fragmentation, throughput loss | PagedAttention | Page-based cache management improves utilization | | Full fine-tuning is too expensive | training memory explodes | LoRA | Train only low-rank updates | | Even LoRA on full-precision base is too heavy | consumer GPU cannot cope | QLoRA | Frozen 4-bit base + tiny trainable adapters | | Team wants fresh proprietary knowledge in answers | model is stale or ignorant of private docs | RAG | Retrieve data at inference instead of baking it into weights | If this table is automatic in your head, you are in good shape.
6.2 Key points to remember¶
- Quantization is not magic. It is controlled information loss.
- Bits are budget. Fewer bits reduce memory and often improve throughput.
- fp16 vs bf16 is not “which is more modern?” It is “precision near a value vs safe dynamic range across values.”
- bf16 is usually better for training because range matters a lot for numerical stability.
- int8 is safer. int4 is more aggressive.
- Per-channel usually beats per-tensor because channels rarely share the same scale naturally.
- GPTQ is calibration-based error-aware quantization.
- AWQ is activation-aware quantization.
- Quantized weights do not automatically solve KV cache growth.
- Concurrency multiplies KV cache memory.
- GQA is a very important production compromise.
- PagedAttention is basically smart cache memory management.
- Full fine-tuning is expensive because training stores much more than just weights.
- LoRA learns a low-rank overlay instead of rewriting the whole base.
- QLoRA combines a quantized frozen base with trainable adapters.
- Fine-tune changes behavior. RAG changes knowledge access. Prompting changes instruction framing. Different tools.
6.3 Important interview questions¶
Here are strong interview-style prompts. 1. Why is bf16 usually preferred to fp16 for training? - Good answer mentions exponent range, stability, overflow/underflow, reduced need for loss scaling. 2. Why does per-channel quantization often outperform per-tensor quantization? - Good answer mentions heterogeneous channel scales and the small-channel-crushing problem. 3. GPTQ vs AWQ — what is the conceptual difference? - Good answer says GPTQ optimizes output reconstruction on calibration data; AWQ uses activation importance to protect salient weights/channels. 4. If a 70B int4 model fits on one 80GB GPU, why might the service still fail at 8K or 16K context? - Good answer mentions KV cache growth with sequence length and concurrency. 5. What does GQA buy you relative to standard multi-head attention? - Good answer mentions a big KV cache reduction with limited quality loss. 6. When would you choose LoRA over full fine-tuning? - Good answer mentions limited hardware, smaller task adaptation, multi-tenant adapters, faster iteration. 7. When is fine-tuning the wrong tool? - Good answer mentions fresh/private knowledge problems where RAG is better.
6.4 Production experience with memory calculations¶
Now let us do three realistic back-of-the-envelope calculations.
Scenario A — Serving a 70B model on one 80GB GPU¶
Assume effective int4 weight footprint is 42GB after runtime overhead for quantized weights.
Assume runtime scratch and allocator overhead take another 5GB.
So fixed memory is:
42GB + 5GB = 47GB
Remaining memory on an 80GB GPU:
80GB - 47GB = 33GB
Suppose each 8K request uses about 2.5GB KV cache with GQA.
Then max theoretical concurrent 8K requests is roughly:
33GB / 2.5GB ≈ 13
That is before safety margin.
So you would not promise 13.
You might promise something like 8–10 depending on jitter, batching, and actual runtime behavior.
This is how a Lead thinks.
Not with one neat number.
With headroom.
Scenario B — What if context doubles?¶
Same model.
Same hardware.
Now context goes from 8K to 16K.
KV cache roughly doubles.
So per request becomes about 5GB.
Now:
33GB / 5GB ≈ 6
Your concurrency roughly halves.
Same weights.
Same GPU.
Only context changed.
This is why product requirements and infra planning are linked.
Scenario C — Choosing a fine-tuning method on a 24GB GPU¶
Suppose you have a 7B model.
fp16 base weights are about 14GB raw.
Full fine-tuning with Adam-style optimizer, gradients, and activations can easily push you well beyond 24GB.
So full fine-tuning is risky or impossible.
LoRA on fp16 might work with careful settings on some setups.
QLoRA is even safer on limited hardware because the frozen base is 4-bit and the adapters are small.
So the good engineering answer is not “full fine-tune because it is strongest.”
The good answer is “pick the method that your hardware can actually support while still meeting eval targets.”
6.5 Apply now — exercises¶
Easy¶
- A 13B model in fp16 uses how much raw memory for weights?
- The same 13B model in int8 uses how much raw memory?
- Why can bf16 be better for training even with fewer mantissa bits than fp16?
Medium¶
- A weight row is
[0.05, -0.08, 0.11, -0.14]and another is[1.4, -1.7, 2.1, -2.8]. Explain why a single per-tensor scale is dangerous. - A 70B-style model has 80 layers, 8 KV heads, head dim 128, context 4096, bf16 cache. Estimate per-request KV cache size.
- For a
4096 x 4096matrix with LoRA rank 32, how many trainable parameters are added?
Hard¶
- Your company wants a support bot that must answer from fresh internal policy docs and also follow a strict JSON schema. Which parts should be solved by prompt, which by RAG, and which by PEFT if needed?
- You have an 80GB GPU, a 70B int4 model, and an 8K latency target. What measurements do you run before declaring the architecture production-ready?
- Explain why “the model fits on the GPU” is not sufficient as a deployment statement. If you can answer all nine clearly, you are no longer thinking at glossary level. You are thinking at system-design level.
6.6 Foundation-gap audit for Module 07¶
Module 08_rag_system_design quietly assumes three things from this module.
If these are weak, RAG decisions become sloppy.
Assumption 1 — model serving basics¶
You should already understand: - weight memory vs KV cache memory - why context and concurrency change serving cost - why inference optimization is not only about model size If not, revisit Chapter 4.
Assumption 2 — memory constraints of production¶
You should be able to do rough calculations like: - 7B fp16 vs int8 vs int4 - 70B on 80GB with long context - why a model that loads may still fail under concurrent traffic If not, revisit Chapters 1, 2, and 4.
Assumption 3 — when to fine-tune vs prompt vs RAG¶
You should already know: - prompt first for instruction/format improvements - PEFT for stable behavior changes with data - RAG for fresh or private knowledge If not, revisit Chapter 5. This matters because Module 07 is not just “what is a vector DB?” It is also “why are we choosing retrieval instead of changing the weights?” That decision depends directly on this module.
6.7 Bridge to the next module¶
Next module — 08_rag_system_design — addresses the other big deployment challenge: the model's knowledge is frozen at training time. RAG lets it access fresh, private data without retraining.
That is the bridge.
This module taught you how to make the model fit and how to adapt its behavior efficiently.
The next module teaches you how to make the model know the right current information at inference time.
Store that distinction carefully.
It is one of the cleanest mental separations in AI engineering.