06. GPTQ — preserve the output, not the illusion¶

~12 min read. The thing that says: do not round blindly. Round where the layer can survive.

Built on the ELI5 in 00-eli5.md. the field notes — compressed notes from the blueprint — get written carefully in GPTQ so layer outputs stay close to the original building plan.

1) Why plain rounding is too careless¶

Look. Naive quantization says, "Take every weight and round it." That sounds efficient. It is also careless. A model does not care about weights in isolation. It cares about outputs after weights meet real activations. So two weights with the same size may have very different impact. One bad rounding choice can bend a sensitive direction. Then the next layer inherits that damage. Then quality drops in a place you did not expect. That is why GPTQ is not just smaller storage. It is smarter writing of the field notes. The target is simple. For calibration inputs X, choose W_hat so W_hat X stays close to W X. Same structure. Smaller memory bill. Less output drift. See the mental picture first.

┌──────────────┐    calibration X    ┌──────────────┐
│ the blueprint│ ─────────────────→  │ layer output │
└──────┬───────┘                      └──────┬───────┘
       │                                    │
       │ write smaller notes                │ compare
       ▼                                    ▼
┌──────────────┐    same calibration X  ┌──────────────┐
│the field notes│ ───────────────────→  │ new output   │
└──────────────┘                        └──────────────┘
                 minimize the gap ───────────→

So what is GPTQ really asking? Not "Which numbers look pretty after rounding?" It asks, "Which mistakes are tolerated by the layer?" That is the whole mood.

2) The core objective with one small example¶

Take one row of weights. Let the original row be W = [0.80, -0.40]. Suppose 2-bit buckets force us near multiples of 0.5. A blunt round gives W_hat = [1.0, -0.5]. Now use two calibration inputs. x1 = [2, 1] x2 = [1, 3] Original outputs: W x1 = 0.80*2 + (-0.40)*1 = 1.20 W x2 = 0.80*1 + (-0.40)*3 = -0.40 Naively rounded outputs: W_hat x1 = 1.0*2 + (-0.5)*1 = 1.50 W_hat x2 = 1.0*1 + (-0.5)*3 = -0.50 Errors are 0.30 and 0.10. Squared error sum is 0.09 + 0.01 = 0.10. Now try a different quantized choice. Use W_hat = [0.5, -0.5]. Then: W_hat x1 = 0.5 W_hat x2 = -1.0 Errors become 0.70 and 0.60. Much worse. So GPTQ would prefer the first option. Yes, both are low-bit. But one keeps outputs much closer on the seen inputs. That is the real objective: minimize ||W X - W_hat X|| Simple, no? The damage is measured after weights touch data. Not before. That is why the rounding error is judged by behavior. Not by pretty-looking buckets.

3) How GPTQ decides where pain should land¶

Now the sharper question. If every weight cannot stay perfect, which one should suffer more? GPTQ uses second-order information. In plain language, it asks which directions are sensitive. Think of curvature. If output changes steeply along one direction, be gentle there. If output changes softly along another direction, spend error there. That is why people say GPTQ is error-aware. It does not treat all coordinates equally. It uses calibration activations. It estimates a Hessian-like signal. Then it quantizes one block or one row while compensating the rest. See the flow.

┌──────────────┐
│ collect X    │
│ calibration  │
└──────┬───────┘
       ▼
┌──────────────┐
│ measure which│
│ directions   │
│ are sensitive│
└──────┬───────┘
       ▼
┌──────────────┐
│ quantize one │
│ weight block │
└──────┬───────┘
       ▼
┌──────────────┐
│ push leftover│
│ error into   │
│ safer places │
└──────┬───────┘
       ▼
┌──────────────┐
│ move to next │
│ block/row    │
└──────────────┘

So GPTQ spends quantization pain where the model tolerates it. That sentence matters. A lot. Because that is why the same 4-bit budget can behave very differently. One method scatters damage blindly. GPTQ steers it. The result is better the field notes from the same memory budget. And note one practical gift. This is post-training quantization. No retraining loop is required. You take the finished the blueprint. You run calibration. You export smaller weights. Done.

4) What GPTQ is good at, and what it is not¶

GPTQ is strong when you want a deployable artifact quickly. You already have the trained model. You do not want another large fine-tune job. You want lower memory and acceptable quality. That is the sweet spot. It is widely supported for that reason. But do not oversell it. Calibration data matters. If your calibration set is unrepresentative, the protected directions may be the wrong ones. Then the model behaves well on sampled traffic and worse on real traffic. Also, GPTQ mainly protects output reconstruction. It is not directly asking which channels are active most often. That question becomes important in the next topic. So when should an engineer reach for GPTQ? Use it when the model is already trained, memory is tight, and you need a robust post-training path. Use it when you want 4-bit deployment without retraining the whole system. Use it when your infra team wants standard tooling. Do not use it as a religion. Measure task quality after quantization. Especially on formatting, code, multilingual edges, and long prompts. Those often show damage first.

Where this lives in the wild¶

Hugging Face Optimum GPTQ — converts full checkpoints into GPTQ deployment artifacts without retraining.
AutoGPTQ — packages and loads GPTQ models for practical 4-bit local and server inference.
vLLM GPTQ support — serves GPTQ-quantized models behind OpenAI-style APIs with lower weight memory.
Text Generation WebUI GPTQ loaders — popular path for fitting Llama-family models on smaller consumer GPUs.
Alibaba Qwen GPTQ releases — distributes ready-made GPTQ checkpoints for lower-VRAM deployment.

Pause and recall¶

Why does GPTQ optimize ||W X - W_hat X|| instead of only weight difference?
What role do calibration inputs play in choosing quantized weights?
What does second-order sensitivity tell GPTQ in plain language?
Why is GPTQ called post-training quantization?

Interview Q&A¶

Q1. Why GPTQ not simple round-to-nearest for a production 4-bit deployment? Because GPTQ chooses low-bit weights to preserve layer outputs on representative inputs, while naive rounding ignores sensitivity. Common wrong answer to avoid: "GPTQ is just faster rounding with a better file format." Q2. Why calibration-based reconstruction not full retraining after quantization? Because GPTQ targets a post-training workflow where you keep the trained model and solve output-preservation offline. Common wrong answer to avoid: "GPTQ works by fine-tuning all weights again for a few epochs." Q3. Why second-order information not only weight magnitude? Because magnitude alone misses which directions are fragile; curvature tells you where a small change causes a big output shift. Common wrong answer to avoid: "Large weights are always the most important ones." Q4. Why GPTQ not AWQ when both make 4-bit field notes? Because GPTQ directly minimizes output reconstruction error, while AWQ asks which channels matter most under real activations. Common wrong answer to avoid: "They are the same method with different names."

Apply now (5 min)¶

Quick exercise. Take W = [1.1, -0.7] and calibration inputs [1, 2] and [3, 1]. Compare two quantized choices: [1.0, -0.5] versus [1.0, -1.0]. Compute both output errors. Pick the better one. Then say in one sentence why GPTQ prefers that choice. Sketch from memory. Draw the flow from the blueprint to calibration outputs to the field notes. Add one box called "sensitive directions." Add one arrow showing where the rounding error gets pushed into safer places.

Bridge. Good. GPTQ watches output error after quantization. Next we ask a different question: which channels matter most because real activations keep hitting them? → 07-awq.md