10. Quantized serving — running smaller weights without pretending quality stays identical¶

~17 min read. Quantization changes the storage format of the pantry, not the user-facing contract.

Built on the ELI5 in 00-eli5.md. The kitchen can store ingredients in smaller containers, so more fits on one shelf, but some recipes become a little less precise.

1) Picture first: smaller number buckets¶

A weight matrix is a giant pile of numbers.

In fp16, each number takes 2 bytes.

In 4-bit quantization, each number takes half a byte, plus some scale metadata.

fp16 shelf                 4-bit shelf
┌──────────────┐           ┌──────────────┐
│ 2 bytes each │           │ 0.5 byte each│
│ bulky bins   │           │ tighter bins │
└──────────────┘           └──────────────┘

See.

The meaning of the model tries to stay the same.

The storage format changes.

That smaller pantry often decides whether one model fits on one GPU at all.

2) GPTQ and AWQ in practical terms¶

You will hear two names often.

GPTQ is a post-training weight quantization approach that tries to preserve outputs while compressing weights aggressively.

AWQ emphasizes keeping important activation-weight interactions safer.

Look. The important practical point is not paper taxonomy. It is this: quantize offline, store smaller weights, then use kernels that know how to read and dequantize them efficiently at serving time. The serving engine must support the chosen format well. Otherwise the theory win stays on paper.

3) Worked example: model size math¶

Suppose a model has 7 billion parameters. At fp16, rough weight storage is:

7B × 2 bytes
= 14 GB

At pure 4-bit, rough storage is:

7B × 0.5 bytes
= 3.5 GB

Now add scales and metadata. Suppose overhead brings total to 4.2 GB. That is still far below 14 GB. Savings are:

14 - 4.2 = 9.8 GB
reduction ≈ 70%

Simple, no? This difference can turn “needs multi-GPU” into “fits on one device.”

4) What gets better, and what can get worse¶

Quantized serving usually improves:

model fit,
effective memory bandwidth pressure,
cost per replica,
deployability on smaller GPUs.

It can worsen:

exact output quality,
calibration on tricky domains,
kernel portability,
latency if dequant kernels are poorly implemented.

So what to do? Benchmark both accuracy and serving speed. Do not assume smaller weights automatically mean faster end-to-end behavior.

5) Good practical rules¶

Most teams quantize weights first, not every tensor in the whole serving path. KV cache may stay in fp16 or fp8. Activations may use different precision again. That mixed strategy often gives the best trade.

The prep station still matters. The batch window still matters. Quantization is one lever among many. But it is a huge lever, because model weights dominate memory. Next we move from backend efficiency to user-visible experience: streaming token delivery on the plating line.

Where this lives in the wild¶

Ollama local serving — quantized weights make large open models usable on consumer laptops and desktops.
LM Studio deployments — GPTQ and related formats let one machine host models that fp16 would not fit.
TensorRT-LLM optimized servers — quantized kernels can increase replica density on NVIDIA hardware.
vLLM open-model endpoints — quantized checkpoints reduce VRAM pressure and may improve serving economics.
Edge assistants running compact LLMs — model fit often depends on quantization before any other optimization matters.

Pause and recall¶

What changes in quantized serving: the task, the model meaning, or the number format?
In the worked example, how large was the 7B model in fp16 and in 4-bit plus overhead?
Why is kernel support essential for practical quantized serving?
Why might a quantized model still fail to speed up end to end?

Interview Q&A¶

Q: Why quantize weights instead of immediately quantizing every tensor in the serving stack?

A: Because weights dominate the fixed memory footprint and are the most powerful first lever. More aggressive whole-stack quantization can add risk without proportional gain.

Common wrong answer to avoid: "Everything should always be quantized as much as possible." The best precision split is workload-dependent.

Q: Why can a 4-bit model be cheaper to host but not always obviously faster?

A: Because speed depends on kernel quality, dequant overhead, cache behavior, and scheduler interactions. Smaller storage alone does not guarantee faster serving.

Common wrong answer to avoid: "Smaller bytes always means lower latency." Runtime details matter.

Q: Why compare GPTQ or AWQ with real task accuracy, not only perplexity summaries?

A: Because serving decisions live in product behavior: coding quality, retrieval grounding, JSON reliability, and domain-specific robustness. Aggregate metrics can hide deployment pain.

Common wrong answer to avoid: "If perplexity moved little, the product is unchanged." Product tasks may be more sensitive.

Q: Why is quantization often a deployment unlock rather than only a micro-optimization?

A: Because it can turn an impossible fit problem into a solvable one. Fitting the model on one cheaper device changes the whole serving architecture.

Common wrong answer to avoid: "Quantization just shaves a few percent." Sometimes it changes the entire topology choice.

Apply now (5 min)¶

Take parameter counts for 3B, 7B, and 13B models. Compute rough fp16 size and rough 4-bit size. Add a small metadata overhead estimate. Then ask which ones fit on your target GPU. Sketch from memory:

the shelf comparison,
the 14 GB versus 4.2 GB example,
and the pros-versus-risks list.

Bridge. Backend throughput means little if the user sees nothing for seconds. Next we study streaming token delivery, where the plating line sends confirmed output early over SSE or WebSocket. → 11-streaming-token-delivery.md