07. AWQ — protect what traffic actually uses¶

~11 min read. The thing that says: a small weight can matter more than a large one.

Built on the ELI5 in 00-eli5.md. the field notes — compressed notes from the blueprint — get simplified in AWQ only after asking which parts of the structure real activations visit most.

1) Why weight size alone misleads you¶

See. Many people inspect weights and say, "This one is tiny. Ignore it." That sounds reasonable. It is often wrong. A weight matters only when it meets an activation. So importance is not weight size alone. It is about contribution to the output. That contribution depends on both pieces. Weight. Activation. Here is the toy example you should remember. 0.20 × 20 = 4.0 1.20 × 0.5 = 0.6 Look carefully. The smaller weight created the bigger effect. Why? Because its channel carried much larger activation traffic. So AWQ asks a very practical question. Before simplifying the field notes, which channels does real traffic lean on? That is the right instinct. Not "Which number looks big inside the tensor?" But "Which path does the model actually use under representative inputs?"

2) The mental model: protect salient channels¶

Imagine a building blueprint. One hallway line on paper is thin. Another wall line is thick. Would you simplify only by line thickness? No. You first ask which hallway gets the most people. That is AWQ. It uses representative activations. Then it identifies salient channels. Those are the channels whose weight-activation interaction matters a lot. Protected channels get gentler quantization treatment. Less protected channels absorb more distortion. See the picture.

input traffic
     │
     ▼
┌──────────────┐
│ activations  │
│ tell us which│
│ channels are │
│ busy         │
└──────┬───────┘
       ▼
┌──────────────┐
│ protect these│
│ salient paths│
└──────┬───────┘
       ▼
┌──────────────┐
│ quantize the │
│ calmer paths │
│ more harshly │
└──────────────┘

Simple, no? AWQ is activation-aware weight quantization. The phrase is literal. Activations decide what deserves protection. That is why the field notes become traffic-aware.

3) One worked numerical example¶

Take two channels. Channel A weight is 0.20. Channel B weight is 1.20. Now use representative activations. Channel A activation is 20. Channel B activation is 0.5. Their contributions are: A: 0.20 × 20 = 4.0 B: 1.20 × 0.5 = 0.6 Suppose 4-bit rounding changes A from 0.20 to 0.10. Contribution becomes 0.10 × 20 = 2.0. Error in contribution is 2.0. Now suppose the same absolute rounding change hits B. B goes from 1.20 to 1.10. Contribution becomes 1.10 × 0.5 = 0.55. Error in contribution is only 0.05. So which channel should we protect more? A. Even though its raw weight is smaller. This is the heart of AWQ. It is not impressed by weight magnitude alone. It watches where activations amplify mistakes. That is how it keeps the rounding error away from busy channels. Same absolute rounding. Very different damage. So the channel score must include activation behavior. Look. A quiet large weight is like a wide road at midnight. A busy small weight is like a narrow lane at office closing time. Which one hurts if blocked? Usually the busy lane. AWQ thinks like traffic control. Not like a ruler measuring line thickness.

4) AWQ versus GPTQ, and when engineers choose it¶

Now compare the two mindsets. GPTQ says, "Match original outputs on calibration data." AWQ says, "Protect channels that matter under real activations." Both are smart. They are not identical. GPTQ is output-reconstruction focused. AWQ is activation-importance focused. So what to do in practice? If your deployment stack supports both, benchmark both. Some models respond very well to AWQ. Especially when a few salient channels dominate quality. AWQ is also attractive because the logic is intuitive. Busy channels get care. Quiet channels take more compression pain. But again, no worship. Representative activations matter. If your calibration traffic is fake, salience estimates become fake too. Then you protect the wrong places in the blueprint. And your supposedly clever the field notes disappoint on real prompts. So always evaluate on the tasks you actually serve. Code generation. JSON format. Tool calls. Long prompts. Multilingual inputs. That is the adult workflow.

Where this lives in the wild¶

MIT Han Lab AWQ — reference implementation showing activation-aware channel protection for low-bit serving.
NVIDIA TensorRT-LLM AWQ — deploys AWQ-style 4-bit weights on NVIDIA inference stacks.
Hugging Face Transformers AWQ loading — brings AWQ checkpoints into standard model-loading workflows.
vLLM AWQ support — serves AWQ-quantized models with API-compatible inference endpoints.
Alibaba Qwen AWQ releases — publishes AWQ variants for fitting stronger models into smaller GPU memory.

Pause and recall¶

Why is a small weight not automatically unimportant?
In the toy example, why did 0.20 × 20 matter more than 1.20 × 0.5?
What does AWQ protect using representative activations?
How is AWQ's question different from GPTQ's question?

Interview Q&A¶

Q1. Why AWQ not rank channels by weight magnitude alone? Because contribution depends on activation size too, so small weights on hot channels can matter more than large weights on quiet channels. Common wrong answer to avoid: "The largest absolute weights are always the most important ones." Q2. Why activation salience not only output reconstruction error? Because AWQ wants to know which channels repeatedly matter under real inputs before spending the quantization budget. Common wrong answer to avoid: "AWQ just copies GPTQ and renames the loss." Q3. Why AWQ not protect every channel equally? Because the point of low-bit compression is selective sacrifice; equal protection wastes budget on low-impact paths. Common wrong answer to avoid: "Uniform treatment is safer because it is fair to all channels." Q4. Why AWQ not GPTQ for every model by default? Because support, architecture behavior, and downstream task sensitivity differ, so you must benchmark on representative traffic. Common wrong answer to avoid: "AWQ is always better because it uses activations."

Apply now (5 min)¶

Quick exercise. Make three channels with weights 0.3, 1.0, and -0.2. Assign activations 15, 0.8, and 12. Compute all three contributions. Circle the channel you would protect first. Then explain your choice in one line. Sketch from memory. Draw one box for activations and one box for weights. Join them into channel contributions. Mark one busy channel as protected from the rounding error. Write this sentence under the diagram: "AWQ asks which part of the field notes gets the most traffic."

Bridge. Good. Weight compression is now clearer. But fitting the weights is only half the story. Serving still breaks if runtime memory explodes with context and concurrency. → 08-kv-cache-memory.md