08. The feed-forward network — the private bench¶

8 minutes. Attention mixes tokens. The private bench transforms one token quietly.

Built on the ELI5 in 00-eli5.md. The private bench — the per-token feed-forward network — takes the inspected residual stream and writes a local edit without consulting other tokens.

Mental model first¶

Attention alone is not enough. Attention is great at mixing information across positions. But after consultation, each token still needs private processing. That is the job of the FFN. The private bench does not look sideways. No consultation. No token-to-token mixing. Pure local computation. Each token goes through the same little neural network. Same weights. Different activations, because each token carries different content. So the social bench answers, "Whom should I listen to?" The private bench answers, "Now that I heard them, how should I rewrite myself?" Picture the shape.

one token vector h
      |
      v
expand:  d_model --> d_ff
      |
      v
activation
      |
      v
shrink:  d_ff --> d_model
      |
      v
ffn edit
      |
      v
residual add

The middle width is larger. So the token gets a wider scratchpad for nonlinear transformation. Then it is compressed back to the residual-stream width. That is why the private bench is powerful. It can invent richer feature combinations inside one token.

Formula view¶

A standard FFN inside the block looks like this.

FFN(h) = W_2 φ(W_1 h + b_1) + b_2

If you think in shapes, it is:

d_model -> d_ff -> d_model

In the original transformer, the activation was ReLU.

φ(z) = max(0, z)

In GPT-2 and many later models, GELU became common. GELU is smoother. It does not hard-cut negatives the way ReLU does. In LLaMA and many modern models, SwiGLU-style FFNs became popular. A common shorthand is:

SwiGLU(h) = W_down( SiLU(W_gate h) ⊙ W_up h )

Do not panic at the extra symbols. The big idea is unchanged. Expand. Apply a nonlinear gate. Shrink back. Typical width choice is:

d_ff = 4 * d_model

That one ratio already explains a lot. The private bench is wide. So it carries many parameters.

Worked numerical examples with ASCII diagrams¶

Use an easy arithmetic example. Take one inspected token vector:

h = [1, 2]

Let the first linear layer expand from 2 to 4 dimensions. Use this toy matrix W_1.

W_1 rows
[1, 0]
[0, 1]
[1, 1]
[2, -1]

Compute the expanded pre-activation.

z_1 = [1, 2, 3, 0]

For arithmetic clarity, use ReLU in this example. Apply activation.

a = ReLU(z_1) = [1, 2, 3, 0]

Now project back to 2 dimensions. Use this toy W_2 as rows that each hidden unit writes with.

W_2 rows
[0.5, 0.0]
[0.0, 0.5]
[0.5, -0.5]
[0.2, 0.2]

Combine the active hidden units.

FFN(h)
= 1[0.5,0.0] + 2[0.0,0.5] + 3[0.5,-0.5] + 0[0.2,0.2]
= [0.5,0.0] + [0.0,1.0] + [1.5,-1.5] + [0.0,0.0]
= [2.0, -0.5]

So the private bench writes the edit:

ffn edit = [2.0, -0.5]

If the incoming residual stream before the add was:

u = [3.0, 1.0]

then after the shortcut add:

y = u + ffn edit
y = [3.0, 1.0] + [2.0, -0.5]
y = [5.0, 0.5]

ASCII picture:

h = [1, 2]
   |
   +--> W_1 --> [1, 2, 3, 0]
                  |
                  v
               ReLU --> [1, 2, 3, 0]
                  |
                  v
                 W_2 --> [2.0, -0.5]
                                 |
incoming residual u -------------+
                                 |
                                 v
                        updated stream [5.0, 0.5]

See the character of the computation. No other token was consulted. No attention weights were formed. One token came in. One token-local edit came out.

FFN as key-value memory¶

Now look at the same example another way. Each hidden unit asks, "Did I detect my favorite pattern?" If yes, it writes something back. So people often describe the FFN as a key-value memory. The hidden detector is the key side. The write-back vector is the value side. In this picture, the third hidden unit is interesting. Its W_1 row is:

[1, 1]

On input h = [1, 2], it fires strongly.

1*1 + 1*2 = 3

That activation then writes using the third W_2 row.

3 * [0.5, -0.5] = [1.5, -1.5]

So a rough story is: "If I see a both-features-present pattern, write this edit back." That is why FFNs can behave like stored feature memories. Many detectors. Many possible write vectors. All local to one token.

Why FFNs hold so many parameters¶

Compare rough parameter counts in one block. Attention projections are about:

Q, K, V, O  -> 4 * d_model^2

A standard FFN with d_ff = 4 * d_model is about:

2 * d_model * d_ff
= 2 * d_model * (4 * d_model)
= 8 * d_model^2

Now compare shares.

FFN share ≈ 8 / (8 + 4) = 2/3

So the private bench often holds about two-thirds of block parameters. That surprises many beginners. But it makes sense. The FFN is very wide. Width costs parameters.

Where this lives in the wild¶

Meta LLaMA models use SwiGLU-style private benches, which is one reason their FFNs are both wide and parameter-heavy.
OpenAI GPT-2 and later GPT-style models rely on GELU-based FFNs to transform each token after attention mixing.
Mistral and Mixtral keep strong per-token FFN components, even when expert routing changes which private bench is active.
Google Gemma uses the same expand-activate-shrink idea inside each block because attention alone cannot do all token-local transformation.
BERT-style encoder models used in products like Google Search ranking and document understanding also include FFNs in every transformer block, not just attention.

Interview Q&A¶

Q: Why is attention alone not enough?
A: Because attention mixes information across tokens, but it does not provide the same rich nonlinear per-token transformation as the FFN. Common wrong answer to avoid: "The FFN is just a tiny classifier head." No. It is part of every block. It rewrites the token representation at every layer. Q: What is the usual FFN shape in a transformer block?
A: d_model -> d_ff -> d_model, often with d_ff = 4 * d_model. Q: Why do FFNs often dominate parameter count?
A: Because the hidden width is large, so the two FFN matrices together are roughly 8 * d_model^2, which is often about two-thirds of the block. Common wrong answer to avoid: "Most parameters are in attention because attention is the main idea." Nice story. Wrong parameter accounting. The private bench is usually heavier. Q: What changes when a model uses GELU or SwiGLU instead of ReLU?
A: The activation and gating behavior change, but the big shape stays the same: expand, apply nonlinearity, shrink, add.

Apply now (5 min)¶

Close the file. Write this formula from memory.

FFN(h) = W_2 φ(W_1 h + b_1) + b_2

Now answer three quick questions. What mixes tokens? What transforms one token locally? Why is d_ff usually bigger than d_model? Last step. Sketch from memory the path h = [1,2] -> [1,2,3,0] -> [1,2,3,0] -> [2.0,-0.5] and then the residual add to [5.0, 0.5].

Bridge. You now know the private bench. Next, zoom out to the larger layouts built from these stations: 09-encoder-decoder-patterns.md.