13. Choosing the right lever — cheapest fix first¶

~11 min read. The hard part is not knowing every tool. The hard part is choosing the cheapest correct one.

Built on the ELI5 in 00-eli5.md. The overlay sketch — a thin custom layer over the blueprint — reminds us of one important truth: sometimes the overlay is useful, and sometimes a better instruction on the existing blueprint is enough.

1) Start with the cheapest lever that can work¶

Engineers get into trouble here. They learn one shiny trick. Then they apply it everywhere. Bad move. A strong prompt is cheaper than RAG. RAG is cheaper than training. PEFT is cheaper than full fine-tuning. QLoRA is cheaper than full fine-tuning when hardware is tight. So the first rule is simple. Try the cheapest lever first. If the model already knows the knowledge and skill, do not train. If the model only needs better instructions, fix the prompt. If the knowledge changes weekly or is private, use retrieval. Do not try to bake fast-changing facts into weights. If the behavior shift repeats across requests, use PEFT. If the capability gap is deep and you have real budget, consider full fine-tuning. If the hardware is limited, QLoRA becomes attractive. See the flow.

user problem
   │
   ├─► model knows it, but response is sloppy?
   │      └─► better prompt
   │
   ├─► facts are private or change often?
   │      └─► RAG
   │
   ├─► format/tone/domain behavior must repeat?
   │      └─► LoRA / PEFT
   │
   ├─► deep capability shift and strong budget?
   │      └─► full fine-tune
   │
   └─► same need, but weak hardware?
          └─► QLoRA

Simple, no? The goal is not to sound advanced. The goal is to solve the product problem cheaply and reliably.

2) The decision table you should actually memorize¶

Do not memorize slogans only. Memorize the mapping. | Situation | Best first lever | Why | |---|---|---| | Model knows knowledge, style is wrong | Better prompt | Cheapest control when capability already exists | | Knowledge is private or changes weekly | RAG | Freshness belongs in documents, not weights | | Need stable format, tone, or domain behavior across requests | LoRA / PEFT | Repeated behavioral shift is worth storing once | | Need deep capability shift, lots of data, and budget | Full fine-tune | Large change may exceed small adapters | | Need adaptation but hardware is limited | QLoRA | Quantized base plus adapters fits smaller GPUs | Now look at two easy mistakes. Mistake one. Using fine-tuning to solve freshness. Suppose refund policy changes every Friday. If you fine-tune on every update, you build a bad process. Use RAG. Mistake two. Using RAG to force stable formatting. Suppose every answer must be valid insurance JSON. Documents can help with content. But stable output structure is a behavior problem. That is where prompt or PEFT helps more. So ask the right question first. Is the gap about knowledge? Or behavior? Or both? That one split saves weeks.

3) Worked examples: same model, different levers¶

Example A. The model already knows SQL. But your analysts want replies in one strict template. That is not a knowledge gap. Start with a stronger prompt. If the format still drifts across thousands of requests, move to LoRA. Example B. Your company policy handbook changes every week. The model answers old policy from memory. That is a freshness gap. Use RAG. Do not keep repainting the overlay sketch for new facts. Example C. You need a radiology assistant with stable terminology, output style, and report schema. The base model is already strong, but the behavior must be consistent. PEFT is a good next lever. Example D. You want the model to perform a deep new task with large labeled data and budget. Maybe a domain-specific code model or speech-text hybrid behavior. Now full fine-tuning may be justified. Example E. Same as Example C, but you only have one 24GB GPU. Then QLoRA is the practical version. Look at the escalation ladder.

prompt
  ↓ if not enough
RAG or PEFT
  ↓ if still not enough
QLoRA or full fine-tune

Notice something important. The ladder is not linear for every problem. RAG and PEFT solve different failure types. RAG fixes missing or fresh evidence. PEFT fixes repeated behavior. Do not substitute blindly.

4) A small cost example keeps you honest¶

Suppose you have 100,000 internal policy pages. They change weekly. You also want answers in a strict bullet format. What should you do? Break the problem apart. Fresh policy knowledge belongs in retrieval. Strict bullet formatting may be handled by prompt first. If the format still drifts badly, PEFT may help. So the answer is not one lever. It is a stack. Now compare rough cost logic. | Lever | Setup cost | Refresh cost | Best for | |---|---|---|---| | Prompting | Very low | Very low | existing skill, weak instructions | | RAG | Medium | Low after pipeline exists | fresh or private knowledge | | LoRA / PEFT | Medium | Medium when behavior changes | repeated style or schema | | Full fine-tune | High | High | deep capability shift | See why the phrase matters. "Try the cheapest lever first." Not because we are lazy. Because each heavier lever creates new operational work. Training jobs. Evaluation loops. Versioning. Serving complexity. Rollback plans. And maybe extra risk. So what to do? Choose the lever that matches the failure mode. Not the trend on social media. Yes?

5) A practical rulebook for teams¶

If you are unsure, ask these five questions. 1. Does the base model already know the task? 2. Is the missing piece mainly fresh or private knowledge? 3. Is the required behavior stable across many requests? 4. Do we have enough labeled data for training? 5. What does our hardware budget allow? These questions map almost directly to the table. They also force better stakeholder conversations. A product manager may say, "The model is wrong." You should ask, "Wrong because it lacks facts, or because it behaves badly?" A founder may say, "Let's fine-tune." You should ask, "Have we already exhausted prompt and retrieval?" A team may say, "Let's use RAG." You should ask, "Are we actually solving formatting drift?" This is senior judgment. Not tool worship. Sometimes the overlay sketch is perfect. Sometimes it is waste. Sometimes the frozen blueprint already knows enough, and better instructions are the real fix.

Where this lives in the wild¶

Morgan Stanley GPT-4 wealth assistant — retrieval is the right lever because fresh internal research and private documents should not be baked into model weights.
GitHub Copilot custom instructions — prompt engineering is the first lever when the base model already knows coding but needs workflow-specific steering.
Predibase fine-tuning platform — teams use LoRA or QLoRA when they need repeatable domain tone, extraction, or schema behavior across requests.
Databricks Mosaic AI model training — full fine-tune becomes the lever when companies have enough data and need a deeper task shift than prompts or adapters deliver.
Hugging Face PEFT on workstation GPUs — QLoRA is the practical lever when teams want adaptation but cannot afford large multi-GPU training runs.

Pause and recall¶

Why should prompting be tested before heavier adaptation?
Why is RAG better than fine-tuning for weekly-changing policy documents?
Why is PEFT better than RAG for stable formatting drift?
When does full fine-tuning become worth the cost?

Interview Q&A¶

Q1. Why RAG not fine-tuning for knowledge that changes every week? A. Because freshness belongs in retrievable documents, while repeatedly retraining weights is slower, costlier, and harder to keep current. Common wrong answer to avoid: "Because fine-tuning cannot represent factual knowledge at all." Q2. Why PEFT not RAG for stable JSON formatting across requests? A. Because repeated output structure is a behavior problem, and retrieval fetches facts, not consistent generation habits. Common wrong answer to avoid: "Because RAG works only for search products." Q3. Why prompt first, not LoRA first, when the base model already knows the task? A. Because prompt changes are cheapest to test, fastest to roll back, and often enough when the gap is instruction clarity rather than model capability. Common wrong answer to avoid: "Prompting is beginner work, so senior teams should skip it." Q4. Why QLoRA not full fine-tuning on limited hardware? A. Because QLoRA preserves most of the adaptation benefit while shrinking the base-memory requirement enough to fit realistic GPUs. Common wrong answer to avoid: "QLoRA is chosen only when you care about inference, not training."

Apply now (5 min)¶

Exercise. Take one real product idea. Write one failure mode caused by missing knowledge. Write one failure mode caused by unstable behavior. Now choose the first lever for each. Then justify the cheaper choice in one sentence. Sketch from memory. Draw the decision flow. Start with prompt. Branch to RAG for fresh knowledge. Branch to PEFT for repeated behavior. Then put full fine-tune and QLoRA at the expensive end.

Bridge. We now have the practical toolkit. But good engineers also admit uncertainty. Quantization and adapters work well, yes, but some parts still resist clean theory. → 14-honest-admission.md