11. LoRA — thin adapters, not full rewrites¶

~12 min read. The trick that changes behavior without redrawing the whole blueprint.

Built on the ELI5 in 00-eli5.md. The overlay sketch — a thin transparent layer placed over the frozen blueprint with just the custom changes for this building — is exactly how LoRA stores useful edits cheaply.

1) Why not update the whole matrix?¶

Full fine-tuning updates every weight. That sounds flexible. It is also expensive. A big transformer has many large matrices. If you update all of them, memory and optimizer state explode. Look at one matrix only. Suppose a projection weight is 4096 × 4096. That is already huge. 4096 × 4096 = 16,777,216 trainable values. So one full update for one matrix means 16.7 million numbers. Now multiply that across many layers. Now add gradients. Now add optimizer state. Now remember the site constraint. GPU memory is not a polite suggestion. It is a hard wall. So what to do? Keep the original weight frozen. Do not redraw the entire blueprint. Add only the task-specific change. That is the mental move. LoRA says the useful edit is often much smaller than the full matrix. Simple, no? You keep the base knowledge. You learn only the deviation. That deviation is your overlay sketch.

2) Picture first: wide to thin to wide¶

Before the formula, see the picture. The original matrix is wide. LoRA does not learn another full wide matrix. It learns a narrow bridge. Then it expands back.

input h
   │
   ├──────────────►┌──────────────────────┐
   │               │ frozen W: 4096×4096 │
   │               └──────────┬───────────┘
   │                          │
   │                          ▼
   └─►┌──────────────┐  ┌──────────────┐  scaled and added
      │ A: 4096×16   │→ │ B: 16×4096   │ ───────────────► ΔW(h)
      └──────┬───────┘  └──────┬───────┘
             │                 │
             └── wide → thin bridge → wide ───────────────┘

See it carefully. A squeezes from many dimensions into rank r. B expands back to the original width. So LoRA learns a low-rank update. The common equation is: ΔW = (alpha / r) * A @ B Here A is d × r. Here B is r × k. And r is small. Usually 8, 16, 32, or 64. So the trainable path is tiny compared with full d × k. Why might this work? Because the task-specific shift is often low-dimensional. Meaning what? The base model already knows language, syntax, and broad reasoning. Your new task may need only a narrow steering signal. Maybe legal tone. Maybe telecom jargon. Maybe stable JSON output. Maybe one support workflow. That extra behavior can often sit inside a small subspace. So the frozen blueprint stays. The overlay sketch captures only the adjustment.

3) The parameter math is the whole selling point¶

Now do the exact count. Full update: 4096 × 4096 = 16,777,216 parameters That is 16.7 million trainable values. LoRA rank 16 uses two thin matrices. A: 4096 × 16 = 65,536 B: 16 × 4096 = 65,536 Total: 65,536 + 65,536 = 131,072 parameters Now compare. 131,072 / 16,777,216 ≈ 0.0078125 So LoRA rank 16 is about 0.78% of the full update. That is the headline. Less than one percent. Same output shape. Far fewer trainable values. See the rank ladder too. | Rank r | Trainable params | Percent of full update | |---:|---:|---:| | 8 | 65,536 | 0.39% | | 16 | 131,072 | 0.78% | | 32 | 262,144 | 1.56% | | 64 | 524,288 | 3.13% | As rank rises, capacity rises. Cost also rises. So what is rank really? It is the width of the adapter bridge. Too small, and the adapter cannot express the shift. Too large, and you start paying unnecessary cost. Worked mini-example. Suppose a helpdesk task needs only tone control. Rank 8 may be enough. Suppose you want domain style plus structured extraction. Rank 16 or 32 may fit better. Suppose you target many modules and want broader behavior changes. Rank 64 may help, but test it. There is no holy rank. There is a trade-off. Yes?

4) Where LoRA is attached, and why that choice matters¶

LoRA is usually attached to attention projections. Common targets are: - q_proj - k_proj - v_proj - o_proj Why these? Because attention routing shapes how information flows. Small changes there can strongly affect behavior. Many teams start with q_proj and v_proj. Why not everything immediately? Because more targets mean more capacity and more cost. See the lever.

fewer targets
   │
   ├─► cheaper training
   ├─► smaller adapters
   └─► less expressive change
more targets
   │
   ├─► higher capacity
   ├─► larger adapters
   └─► more memory and tuning work

You can also target feed-forward layers. That may help harder task shifts. But again, more capacity is not free. Suppose one layer has four attention targets. If each target uses rank 16 LoRA, the cost multiplies. That may still be cheap compared with full fine-tuning. But it is not zero. This is why adapter design matters. Which modules changed? How many layers changed? What rank changed? What alpha changed? These are capacity decisions. Not decoration. One more practical note. At inference time, you can keep one frozen base model. Then swap many different overlay sketch files on top. That is operationally beautiful. One base. Many behaviors. Much less storage than many full model copies.

5) What LoRA is good at, and where it can disappoint¶

LoRA is strong when the base model already has the core skill. You want to steer it reliably. Not teach it an entirely alien capability from nothing. Good fits include: - stable response format - brand or domain tone - extraction schemas - tool-call style consistency - narrow workflow behavior Weak fits include: - missing world knowledge that changes weekly - deep reasoning gaps in the base model - tasks with too little data and too much ambition If the problem is freshness, use retrieval. Do not paint new facts into the weights every week. If the problem is only instructions, fix the prompt first. Try the cheapest lever first. LoRA sits in the middle. Cheaper than full fine-tuning. Stronger than prompt-only control for repeated behavior shifts. But not magic. If the required change is not low-rank, LoRA may plateau. If your data is weak, LoRA will faithfully learn weak habits. If you attach too little capacity, it will underfit. If you attach too much, it may overfit or waste memory. So what to do? Measure on the task. Always.

Where this lives in the wild¶

Hugging Face PEFT — ML engineers attach LoRA modules to q_proj and v_proj for task-specific tuning without cloning the whole model.
Predibase LoRAX — serving engineers host many customer adapters on one frozen base and hot-swap behavior per request.
NVIDIA NeMo — enterprise teams train LoRA adapters for domain style, extraction, and instruction-following on limited GPU budgets.
Databricks Mosaic AI — platform users fine-tune open models with adapter-style methods when repeated format control matters more than full weight updates.
Together AI fine-tuning API — product teams train lightweight adapters for custom assistants while keeping the large base fixed.

Pause and recall¶

Why does LoRA use two thin matrices instead of one full update?
In the 4096 × 4096 example, why is rank 16 so cheap?
Why do teams often start with q_proj and v_proj?
What happens when you target more modules or raise the rank?

Interview Q&A¶

Q1. Why LoRA not full fine-tuning for a narrow formatting change? A. Because the base model already knows language generation, so a small low-rank edit often captures the repeated format shift at a fraction of the memory and optimizer cost. Common wrong answer to avoid: "LoRA is always better because fewer parameters automatically means better quality." Q2. Why rank 16 not rank 256 by default? A. Because rank is capacity, and extra capacity should be earned by evaluation, not assumed; a larger rank costs more memory and can still overfit weak data. Common wrong answer to avoid: "Higher rank is always safer because it is closer to full fine-tuning." Q3. Why target q_proj and v_proj first, not every matrix immediately? A. Because attention projections often give strong behavior leverage, while full coverage increases adapter size, tuning complexity, and training cost. Common wrong answer to avoid: "Those layers are chosen only because libraries hard-code them." Q4. Why serve many LoRA adapters on one base instead of many full model copies? A. Because the frozen base can be shared, so storage, loading time, and deployment complexity all fall sharply. Common wrong answer to avoid: "Adapters help only during training; they do not matter for serving architecture."

Apply now (5 min)¶

Exercise. Take one transformer layer with q_proj, k_proj, v_proj, and o_proj. Assume each matrix is 4096 × 4096. Compute the full update count for one matrix. Then compute rank 16 LoRA for one matrix. Then multiply by four targets. Write one sentence on the memory win. Sketch from memory. Draw the frozen wide matrix. Then draw the thin bridge A and B beside it. Label it wide → thin bridge → wide. Finally, mark the adapter as the overlay sketch sitting on top of the blueprint.

Bridge. LoRA is elegant, but it still needs the frozen base model in memory. If the base itself is too heavy, we need one more trick. → 12-qlora.md