Skip to content

14. Honest admission — the parts we still learn by testing

~11 min read. This is the file that saves you from fake certainty.

Built on the ELI5 in 00-eli5.md. The rounding error — the detail lost when field notes replace the full blueprint — reminds us of the deepest truth in this module: we know compression works, but we do not always know exactly when the quality bill will arrive.


1) We still cannot predict every quantization trade-off cleanly

Here is the honest sentence. We cannot predict exactly when int4 will degrade versus int8 without testing. There is no universal formula. Yes, some patterns are common. Yes, some models tolerate int4 surprisingly well. But tolerance is task-dependent. Runtime-dependent too. Prompt-dependent sometimes. Look at a tiny fictional benchmark picture. | Model | Task | int8 drop | int4 drop | |---|---|---:|---:| | Model A | classification | 0.3 | 1.2 | | Model B | code completion | 0.4 | 5.8 | | Model C | extraction | 0.1 | 0.9 | See the problem. The same compression level does not cause the same pain everywhere. So when someone asks, "Will int4 be fine?" The adult answer is, "Maybe. Show me the eval." This uncertainty also appears in method choice. GPTQ versus AWQ has no universal winner. Some models run better with one. Some tasks preserve quality better with the other. Some runtimes support one more efficiently. So what to do? Benchmark on your target stack. Not on internet folklore. That is where the rounding error becomes real engineering.


2) LoRA is powerful, but its theory is still incomplete

LoRA assumes the useful task shift is low-rank. Often that is good enough. Not always. Some task shifts may not fit a tiny low-dimensional bridge. If the change is broad, rank-limited adapters may plateau. That is one uncertainty. Second uncertainty. Optimal rank is still mostly empirical. People talk about 8, 16, 32, or 64. Fine. But those are heuristics. You still need hyperparameter search. You still need task evaluation. There is no clean law saying rank 16 is always enough for domain X. Third uncertainty. We do not have a sharp theory for when LoRA matches full fine-tuning quality and when it cannot. Sometimes the gap is tiny. Sometimes it is meaningful. Sometimes extra targets close the gap. Sometimes they do not. See the picture.

task shift
   ├─► narrow and repeated ──► LoRA often strong
   ├─► broad but moderate ───► maybe higher rank or more targets
   └─► deep capability shift ─► LoRA may hit a ceiling
So yes, LoRA is practical. But no, it is not fully explained. The overlay sketch often works beautifully. We just cannot promise it for every building.


3) Memory beyond weights is still a moving target

Teams often talk as if quantization solves memory. Only partly. KV cache quantization is still evolving. Int8 KV cache can save memory with modest pain. Int4 KV cache can save more memory, but quality risk rises. The safe choice depends on model, context length, and workload. There is no settled final answer yet. The same is true for broader scaling laws. We do not fully understand how quantized models scale with size, data, and compression together. Do larger models remain more robust under aggressive quantization? Sometimes yes. Always? Not proven cleanly. Even training memory remains messy. Weights are one bill. Activations are another bill. Sequence length changes the story again. Kernel support changes it again. So one memory number can mislead badly. You may compress weights and still hit out-of-memory errors from activations. Simple, no? Memory is not one box. It is a stack of boxes. And the rounding error is only one part of the risk.


4) Formats and ecosystems are still fragmented

Another honest admission. GGUF is useful. So are other formats. But the ecosystem is fragmented. One runtime loves one format. Another runtime prefers a different quantization path. One serving stack optimizes GPTQ. Another is stronger with AWQ. One local runtime leans hard into GGUF. Another production stack ignores it. That means tooling choices leak into model choices. Which is annoying. But real. A model file is not just math. It is an ecosystem decision. It affects conversion tools. It affects deployment speed. It affects who on the team can debug it. See the layout.

model choice
   ├─► quantization method
   │      ├─ GPTQ
   │      └─ AWQ
   ├─► file format
   │      ├─ GGUF
   │      └─ others
   └─► runtime support
          ├─ laptop / edge
          ├─ single GPU server
          └─ multi-GPU production stack
So when someone asks for the best format, be careful. Best for what? Laptop serving? vLLM cluster? llama.cpp on CPU? TensorRT-LLM on NVIDIA GPUs? Different answers. No universal crown.


5) What a serious engineer does with all this uncertainty

Uncertainty is not failure. Hidden uncertainty is failure. So the correct behavior is straightforward. Benchmark int8 and int4 on your real tasks. Compare GPTQ and AWQ on your actual runtime. Test LoRA ranks instead of debating them abstractly. Measure KV cache options with your context lengths. Record both quality and latency. Keep the eval set honest. Keep the rollback path ready. And say "we do not know yet" when you do not know yet. That sentence is strength. Not weakness. The whole module has been about trade-offs. This file is the final reminder. Trade-offs without measurement become superstition. Measurement turns uncertainty into engineering. Yes?


Where this lives in the wild

  • vLLM — inference engineers benchmark AWQ, GPTQ, and KV cache strategies because runtime behavior changes across serving setups.
  • llama.cpp — edge-deployment builders lean on GGUF, then tune quantization levels based on laptop, CPU, and small-GPU quality trade-offs.
  • TensorRT-LLM — performance teams test precision formats and kernels per NVIDIA stack because the fastest option depends on hardware and model family.
  • Hugging Face Transformers + bitsandbytes — practitioners compare 4-bit recipes, adapter ranks, and sequence lengths because one default does not fit every workload.
  • Ollama — local-serving users feel ecosystem fragmentation directly when model packaging, quantization choice, and runtime support all interact.

Pause and recall

  • Why can you not promise that int4 will behave like int8 on every task?
  • Why is LoRA rank selection still mostly empirical?
  • Why is there no universal winner between GPTQ and AWQ?
  • Why can memory still break after weight quantization succeeds?

Interview Q&A

Q1. Why benchmark int4 against int8, not assume the smaller format is always acceptable? A. Because degradation depends on the model, task, runtime, and evaluation target, so the same 4-bit choice can be harmless in one case and damaging in another. Common wrong answer to avoid: "Because int4 always destroys accuracy, so benchmarking is only for documentation." Q2. Why LoRA not full fine-tuning in every adaptation problem? A. Because the low-rank assumption may fail for broader capability shifts, and we still lack a clean theory for exactly where that ceiling appears. Common wrong answer to avoid: "LoRA and full fine-tuning are theoretically equivalent once alpha is tuned." Q3. Why search LoRA rank, not just fix rank 16 forever? A. Because the best rank depends on the task shift, target modules, data quality, and acceptable cost, so one default cannot be trusted universally. Common wrong answer to avoid: "Rank matters only for training speed, not model behavior." Q4. Why GGUF not the universal final format for everyone? A. Because format choice is tied to runtime support, tooling, hardware, and deployment style, so a great edge format may not be the best production-cluster format. Common wrong answer to avoid: "All formats are interchangeable once the weights are quantized."


Apply now (5 min)

Exercise. Pick one model you care about. Write two candidate precisions. Write one task where quality matters most. Write one runtime where latency matters most. Now list the three measurements you would collect before declaring a winner. Sketch from memory. Draw three boxes. Label them quantization method, adapter choice, and runtime. Then draw arrows into one final box called eval. Finally, write beside it: "Assume less. Test more."


Bridge. The model now fits, serves, and adapts. But its knowledge is frozen at training time. When facts change or private documents arrive, retrieval is the answer. → ../08_rag_system_design/00-eli5.md