05. Pre-norm vs post-norm — where the inspector stands¶

7 minutes. Same station. Move the inspector one step, and deep training changes.

Built on the ELI5 in 00-eli5.md. The quality inspector — layer normalization checking scale — now changes how clean the shortcut pipe stays.

Mental model first¶

One station has two heavy jobs. First the social bench. Then the private bench. The only question here is location. Where does the quality inspector stand? In pre-norm, the inspector stands before each heavy job. The branch gets normalized. The shortcut pipe carries the old packet unchanged. Then the add happens. In post-norm, the heavy job runs first. Then the old packet and new edit are added. Then the combined packet is normalized. So the next layer never sees a perfectly untouched identity path. See the two layouts.

Pre-norm
x ------------------------------+
|                               |
|--> LN --> Attention ----------+--> u
u ------------------------------+
|                               |
|--> LN --> FFN ----------------+--> y

Post-norm
x ------------------------------+
|                               |
|--> Attention -----------------+--> add --> LN --> u
u ------------------------------+
|                               |
|--> FFN -----------------------+--> add --> LN --> y

Now say it in ELI5 language. Pre-norm says, "Inspect the branch before work." Post-norm says, "Inspect the combined stream after work." That tiny move changes optimization. Why? Because the shortcut pipe is either clean or touched. In pre-norm, the pipe carries raw x to the add. In post-norm, the pipe value gets mixed and then normalized. So the direct copy path is less direct. This is why pre-norm became the practical default for deep LLMs. Depth punishes dirty identity paths. A clean pipe buys stability.

Formula view¶

Write one block with two sublayers.

Pre-norm
u = x + Attention(LN(x))
y = u + FFN(LN(u))

Post-norm
u = LN(x + Attention(x))
y = LN(u + FFN(u))

Now look only at the copy route. For pre-norm, the block is close to x + small edits. For post-norm, the block is closer to LN(x + small edits). That difference matters in backprop too. For one pre-norm sublayer,

x_(l+1) = x_l + f(LN(x_l))
∂x_(l+1)/∂x_l ≈ I + J_f J_LN

The important symbol is I. There is always a direct gradient route. The model can fall back to near-identity behavior. For one post-norm sublayer,

x_(l+1) = LN(x_l + f(x_l))
∂x_(l+1)/∂x_l ≈ J_LN (I + J_f)

Now the route is multiplied by J_LN. The identity path still exists, but it is no longer untouched. Each layer can shrink or stretch some directions. Stack many layers and the chain gets harder to manage. A useful memory line is this.

pre-norm  : clean pipe + normalized branch
post-norm : normalized combined stream

Post-norm does have one real advantage. The block output always passes through normalization. So the outgoing representation is more explicitly standardized. That can help in some settings. But very deep decoder training usually values the cleaner path more.

Worked numerical examples with ASCII diagrams¶

Start with a toy gradient story. Do not over-read the numbers. They only show the chain. Assume each sublayer contributes a small derivative 0.05. Assume the LayerNorm Jacobian on one direction has size 0.8. For pre-norm, one layer contributes about:

1 + 0.05 = 1.05

Across 12 layers,

1.05^12 ≈ 1.80

For post-norm, one layer contributes about:

0.8 * (1 + 0.05) = 0.84

Across 12 layers,

0.84^12 ≈ 0.12

Same edit size. Different inspector placement. That is the clean-path story in one glance. Now do a forward-pass toy example. Take the incoming residual stream:

x = [4.0, 2.0, 0.0]

In pre-norm, inspect first.

mean = 2.0
centered = [2.0, 0.0, -2.0]
std ≈ 1.633
LN(x) ≈ [1.225, 0.000, -1.225]

Suppose the social bench writes:

Attention(LN(x)) = [0.2, -0.1, 0.0]

Then add through the shortcut pipe.

u = x + edit = [4.2, 1.9, 0.0]

Inspect u again.

mean ≈ 2.033
centered ≈ [2.167, -0.133, -2.033]
std ≈ 1.796
LN(u) ≈ [1.206, -0.074, -1.132]

Let the private bench write:

FFN(LN(u)) = [0.1, 0.2, -0.1]

Final add:

y = [4.3, 2.1, -0.1]

ASCII picture:

raw x -----------+
                 +----> [4.2, 1.9, 0.0] -----------+
LN -> Attn edit -+                                 +----> [4.3, 2.1, -0.1]
                                                   |
                                     LN -> FFN edit-+

Now compare with post-norm from the same input. Suppose attention on raw x is sharper.

Attention(x) = [1.8, -1.0, 0.6]

Add first.

s = x + edit = [5.8, 1.0, 0.6]

Now normalize the combined result.

mean ≈ 2.467
centered ≈ [3.333, -1.467, -1.867]
std ≈ 2.367
LN(s) ≈ [1.408, -0.620, -0.789]

Suppose the private bench writes:

FFN(LN(s)) = [0.7, -0.5, 0.1]

Add and normalize again.

t = [2.108, -1.120, -0.689]
y ≈ [1.402, -0.862, -0.540]

ASCII picture:

raw x ----+
          +--> add --> LN --> intermediate ----+
Attn edit-+                                    +--> add --> LN --> y
                                               |
                                     FFN edit--+

What changed? In pre-norm, the branch was normalized. In post-norm, the combined stream was normalized. So the direct copy path was cleaner in pre-norm.

Stability cheat sheet¶

Use this table when the interview gets fast. | Question | Pre-norm | Post-norm | |---|---|---| | LN location | Before each sublayer | After each residual add | | Identity path | Cleaner | Modified by LN | | Deep training | Usually easier | Often harder | | Final block output normalized? | Not guaranteed | Yes | | Common in modern LLMs? | Very common | Less common | Why do GPT, LLaMA, Mistral, Gemma, and many Claude-like stacks prefer pre-norm? Because very deep decoder stacks need reliable optimization. Because the residual stream needs a stable highway. Because the shortcut pipe should stay as untouched as possible. When might someone still like post-norm? When they care about the block output always being normalized. When the stack is not extremely deep. When a specific training setup is tuned for it.

Where this lives in the wild¶

ChatGPT and GPT-style models use pre-norm style blocks because stable deep training matters more than post-norm neatness.
Meta LLaMA models use RMSNorm-first blocks, which keep the same pre-norm spirit with a slightly different inspector.
Mistral and Mixtral normalize before heavy work, then add edits back into the residual stream.
Google Gemma follows the same normalize-first habit because scaling depth is easier with cleaner residual paths.
Many open Hugging Face decoder checkpoints also choose pre-norm or RMSNorm-first layouts for the same reason.

Interview Q&A¶

Q: Why do modern LLMs usually prefer pre-norm?
A: Because the shortcut pipe keeps a cleaner identity route, so gradients survive deep stacks more reliably. Common wrong answer to avoid: "Because pre-norm is newer." That is not the reason. The real reason is optimization stability. Q: What is the main mathematical difference in the gradient path?
A: Pre-norm keeps an I + ... route, while post-norm multiplies the route by the LayerNorm Jacobian at every layer. Common wrong answer to avoid: "Post-norm has no residual connection." Wrong. It still has residual addition. The issue is that the identity route is normalized afterward. Q: Does post-norm have any advantage?
A: Yes. The block output always passes through normalization, which can help keep representations well-shaped in some settings. Q: If a 96-layer decoder diverges, what is one architecture question to ask first?
A: Ask whether the stack is post-norm and whether moving the inspector before each sublayer would stabilize training.

Apply now (5 min)¶

Draw both layouts from memory. Label the quality inspector and the shortcut pipe. Then write these two lines without looking:

pre : x + f(LN(x))
post: LN(x + f(x))

Now answer in one sentence. Why is the shortcut pipe cleaner in pre-norm? Last step. Sketch from memory the two ASCII stations and the gradient comparison 1.05^12 versus 0.84^12.

Bridge. You now know where the inspector stands. Next, see the whole station at once: 06-transformer-block.md.