12. QLoRA — compressed base, tiny trainable overlay¶
~12 min read. The trick that lets a large model fit before training even starts.
Built on the ELI5 in 00-eli5.md. The overlay sketch and field notes — the thin custom layer and the compressed weights — work together here: you carry compact field notes to the site, place the overlay sketch on top, and never alter the base.
1) Why plain LoRA is not enough on small hardware¶
LoRA saves trainable parameters.
Good.
But plain LoRA still keeps the base model in regular precision.
That is the catch.
Suppose you want a 7B model.
In fp16, raw weight memory is about 14GB.
Now add runtime overhead.
Now add activations.
Now add gradients for adapters.
Now add optimizer state for adapters.
A 24GB GPU can feel tight very quickly.
So what to do?
Freeze the base model.
Quantize the frozen base to 4-bit.
Then train only LoRA adapters on top.
That combination is QLoRA.
See the picture.
training time
┌────────────────────────────────────────────────────┐
│ frozen 4-bit base = the field notes │
│ small LoRA adapters = the overlay sketch │
│ gradients flow only through adapters │
└──────────────────────────────┬─────────────────────┘
│
▼
fit a larger base on one GPU
2) The memory picture that makes QLoRA click¶
Start with raw weight math.
A 7B model in fp16 uses roughly:
7B × 2 bytes = 14GB
That is only the raw weight bill.
Not the whole training bill.
Now quantize that frozen base to 4-bit.
Raw math becomes:
7B × 0.5 bytes = 3.5GB
In practice, the effective footprint is often more like 3.5GB to 5GB.
Why the range?
Because metadata, scales, packing, and kernels add overhead.
Still, it is far smaller than 14GB.
Now add tiny adapters.
Suppose your LoRA adapters total 100MB to 300MB.
Now add activations and temporary buffers.
A 24GB GPU now becomes realistic.
Worked example.
Imagine this rough budget on one 24GB card.
| Item | Rough memory |
|---|---:|
| 7B frozen fp16 base | 14GB |
| 7B frozen 4-bit base | 4.2GB |
| LoRA adapters | 0.2GB |
| Activations + misc | 10GB |
| Total with QLoRA | 14.4GB |
This is not a universal formula.
It is a planning picture.
Look at the contrast.
Plain fp16 base plus activations would already crowd the card.
QLoRA leaves breathing room.
So when someone says, "LoRA is small," ask the next question.
"Small compared to what?"
If the base still dominates memory, plain LoRA is not the full answer.
QLoRA is.
3) What actually receives gradients?¶
Only the adapters receive gradient updates. This point matters. The base model participates in forward and backward computation. But its 4-bit weights are frozen. You do not optimize them directly. You backprop through the graph so the LoRA weights can learn. That keeps optimizer state tiny. That keeps trainable memory tiny. That keeps the update focused. See the flow.
input tokens
│
▼
[frozen 4-bit base]
│
├──────────────► activations
│
└──► [LoRA adapter] ──► loss
▲
│
gradients update only here
4) NF4 and paged optimizers are the practical enablers¶
Two ideas made QLoRA especially useful. First, NF4. NF4 stands for NormalFloat4. The short intuition is enough here. Model weights are not distributed like arbitrary integers. They often cluster in ways that a smarter 4-bit codebook can exploit. NF4 is designed for that distribution. So it preserves more useful signal than a naive equally spaced int4 scheme. It is still quantization. So the rounding error still exists. But it is a better-shaped compromise. Second, paged optimizers. Training can create memory spikes. Especially with long sequences. Paged optimizers reduce peak pressure by handling optimizer memory more carefully. So you avoid some ugly out-of-memory crashes. Does this mean QLoRA solves everything? No. Sequence length still matters. Batch size still matters. Activation memory still matters. The base weights are not the whole bill. Say this sentence out loud. "Quantizing weights is not the same as quantizing the entire training problem." Yes?
5) When QLoRA shines, and when to be careful¶
QLoRA shines when hardware is limited and the task shift is moderate. It is excellent for domain adaptation on one or a few GPUs. It is excellent when the base model is already competent. It is excellent when you need repeated behavior, not fresh weekly facts. But be careful in four cases. First, very long context. Even with a tiny base footprint, activations can dominate. Second, weak runtime support. If kernels are poor, memory wins may not convert into speed wins. Third, unrealistic expectations. QLoRA does not magically turn a weak base into a frontier model. Fourth, aggressive compression on fragile tasks. Some tasks tolerate 4-bit well. Some do not. Test on your workload. Not someone else's benchmark. That is the adult rule.
Where this lives in the wild¶
- Hugging Face Transformers + bitsandbytes — practitioners load 4-bit frozen bases and train LoRA adapters on a single workstation GPU.
- Unsloth — fine-tuning engineers use optimized QLoRA recipes to squeeze longer sequences and faster training into prosumer hardware.
- Predibase — platform teams offer adapter-style fine-tuning so customers avoid storing and retraining full model copies.
- Databricks Mosaic AI — enterprise builders choose memory-efficient adapter training for domain tasks that must fit practical GPU budgets.
- NVIDIA NeMo — ML platform teams combine quantized loading and PEFT methods to adapt large models under hardware limits.
Pause and recall¶
- Why can plain LoRA still fail on a 24GB GPU?
- What changes in memory when the frozen base moves from fp16 to 4-bit?
- Why do gradients update only the adapter weights?
- Why do sequence length and activations still matter after quantization?
Interview Q&A¶
Q1. Why QLoRA not plain LoRA when hardware is tight? A. Because plain LoRA shrinks the trainable update but still keeps a large regular-precision base in memory, while QLoRA compresses that frozen base as well. Common wrong answer to avoid: "QLoRA is just faster LoRA with no accuracy trade-off to consider." Q2. Why freeze the 4-bit base instead of training all quantized weights directly? A. Because the practical memory win comes from keeping the base compressed and fixed while learning only small full-precision adapter weights. Common wrong answer to avoid: "Because 4-bit weights cannot participate in backpropagation at all." Q3. Why NF4 not naive uniform int4? A. Because NF4 better matches the observed weight distribution, so the same 4-bit budget preserves more useful signal. Common wrong answer to avoid: "NF4 is better only because it uses larger files." Q4. Why paged optimizers not just a bigger GPU? A. Because memory spikes from optimizer state and long sequences still matter in real systems, and software tricks can make existing hardware usable sooner and cheaper. Common wrong answer to avoid: "Paged optimizers remove activation memory, so sequence length stops mattering."
Apply now (5 min)¶
Exercise.
Take a 7B base model.
Write the raw fp16 weight memory.
Then write the raw 4-bit weight memory.
Now add a rough 0.2GB adapter budget.
Then write one line on why that changes feasibility on a 24GB GPU.
Sketch from memory.
Draw one box for the field notes.
Draw one small side box for the overlay sketch.
Then draw a gradient arrow only into the adapter box.
Finally, write under the picture: "Base frozen. Adapter learns. Activations still cost."
Bridge. We now know how to adapt cheaply. But cheap adaptation is still only one lever. The next question is harder: when should we prompt, retrieve, adapt, or fully fine-tune? → 13-peft-vs-rag-decision.md