Skip to content

Decision Memo — Quantize, Adapt, Decide

Question

Should this use case ship with:

  • base model,
  • quantized base,
  • LoRA,
  • or QLoRA?

Initial answer

For a private support or runbook assistant on constrained hardware:

  1. Serve a quantized base if latency and memory are the first constraint.
  2. Add LoRA if the model needs a stable behavior shift into your domain voice or response format.
  3. Use QLoRA when full-precision adapter training is not feasible because the base model only fits in compressed form.
  4. Switch to RAG instead of re-tuning when the missing value is fresh private knowledge, not behavior.

Metric priority

For most internal assistant systems, the decision order is:

  1. quality floor,
  2. memory fit,
  3. latency,
  4. training cost.

If the model does not fit or cannot meet latency, quality wins in theory but loses in deployment reality.

Prompt vs PEFT vs RAG

  • Prompting: use when the base model already knows the task and just needs clearer instruction.
  • PEFT: use when the response style, structure, or behavior should shift consistently over many requests.
  • RAG: use when the problem is fresh knowledge, private documents, or frequently changing facts.

Default recommendation

For the kind of AI-platform notes used in this hands_on_lab:

  • start with prompting + evaluation
  • if the model still misses the behavior repeatedly, try LoRA
  • if the model does not fit comfortably, consider quantized serving and QLoRA training
  • if the knowledge changes every week, stop tuning and build RAG

Stretch answer

If this system moved to a private-doc setting tomorrow, what would Module 07 change?

Module 07 would change the center of gravity from memorizing domain facts to retrieving them. The decision would move away from repeated fine-tuning and toward chunking, embeddings, retrieval quality, citations, and refusal-when-empty behavior.