02. Activation functions — the bend that saves the rule pile¶

The fold that turns flat paper into origami. Pick the right one or your gradients die.

Built on 00-eli5.md. The bend is the star here. Without it, the rule pile collapses into a single straight rule.

The picture before the math¶

See. A neural network layer does two flat things — stretch (multiply by W) and shift (add b). Both keep the data flat. Stack ten such layers. Still flat.

A flat sheet of paper cannot make a cup. To shape anything, you must fold. That fold is the activation function. The bend.

Without bend:                With bend:

flat ──────  flat            flat ─╮       
     ╲      ╱                       ╲      
      stack                          fold     fold
     ╱      ╲                       ╱      ╱   ╲
flat ──────  one               curve     curve  curve
   big flat line                    (any shape)

That is the whole job. Break the linearity. Let curves emerge.

Proof — stacked linear layers collapse to one line¶

Now what is the problem? Without the bend, depth is fake. Let us prove this with three concrete attempts. Pick any weights you like. Always one line falls out.

Attempt 1 — small weights¶

W₁ = 2, b₁ = 3. W₂ = 4, b₂ = 1. Input x = 5.

Layer 1: h = 2·5 + 3 = 13
Layer 2: y = 4·13 + 1 = 53

Substitute: y = 4(2x + 3) + 1 = 8x + 13. Check at x = 5: 8·5 + 13 = 53. Match. Two layers = one layer with W = 8, b = 13.

Attempt 2 — bigger weights, same story¶

W₁ = -3, b₁ = 0.5. W₂ = 7, b₂ = -2. Input x = 1.

Layer 1: h = -3·1 + 0.5 = -2.5
Layer 2: y = 7·(-2.5) - 2 = -19.5

Substitute: y = 7(-3x + 0.5) - 2 = -21x + 1.5. At x = 1: -21 + 1.5 = -19.5. Match. One line again — W = -21, b = 1.5.

Attempt 3 — three layers, one line¶

W₁ = 2, b₁ = 0. W₂ = -1, b₂ = 1. W₃ = 0.5, b₃ = -3. Input x = 4.

h₁ = 8. h₂ = -1·8 + 1 = -7. y = 0.5·(-7) - 3 = -6.5.

Substitute end-to-end: y = 0.5(-1(2x) + 1) - 3 = -x + 0.5 - 3 = -x - 2.5. At x = 4: -4 - 2.5 = -6.5. Match. Three layers = one line, W = -1, b = -2.5.

The structural reason. Composition of affine maps is affine. W_n W_{n-1} ... W_1 is one matrix. The biases collapse into one bias. The rule pile collapses.

So what to do? We bend.

The four bends — shapes you must own¶

ReLU — the sharp elbow¶

ReLU(x) = max(0, x)

   y
   ↑
   │       ╱
   │     ╱
   │   ╱
   │ ╱
───┴──────────────→ x
   0

Negative? Zero. Positive? Pass through.

Worked values: ReLU(-2) = 0. ReLU(-0.01) = 0. ReLU(0) = 0. ReLU(0.3) = 0.3. ReLU(7) = 7.

Derivative: 1 for x > 0, 0 for x < 0. A flat shelf and a 45° ramp.

Sigmoid — the smooth S¶

sigmoid(x) = 1 / (1 + e^-x)

   y
 1 ┤            _____
   │         ___
   │       _/
0.5┤      /
   │    _/
   │ ___
 0 ┤_____            
───┼───────────→ x
   -∞    0    +∞

Squashes the real line into (0, 1).

Worked values: σ(-4) = 0.018. σ(-1) = 0.27. σ(0) = 0.5. σ(2) = 0.88. σ(5) = 0.993.

Derivative: σ(x)·(1 - σ(x)). Peaks at 0.25 when x = 0. Falls to ~0.018·0.982 ≈ 0.018 at x = -4. Almost dead.

Tanh — sigmoid centered at zero¶

tanh(x) = (e^x - e^-x) / (e^x + e^-x)

   y
 1 ┤        _____
   │     __/
 0 ┤───/─────
   │ _/
-1 ┤___           
───┼──────────→ x

Output range (-1, 1). Zero-centered. Same S shape, just shifted.

Worked values: tanh(-2) = -0.964. tanh(0) = 0. tanh(1) = 0.762. tanh(3) = 0.995.

Derivative: 1 - tanh²(x). Peaks at 1.0 when x = 0. Stronger gradient than sigmoid in the middle. But still flat-tailed — gradient at tanh(3) is 1 - 0.99 ≈ 0.01. Dies in the tails.

GELU — the smooth elbow¶

GELU(x) ≈ x · Φ(x)     where Φ is the standard normal CDF

   y
   ↑
   │       ╱
   │     ╱
   │   ╱
   │ ╱
───┴──── (slight dip below zero, then smooth merge)
   0

Looks like ReLU. But the corner is rounded. And small negative values are not zeroed — they pass through with a gentle squash.

Worked values: GELU(-2) ≈ -0.045. GELU(-0.5) ≈ -0.154. GELU(0) = 0. GELU(0.5) ≈ 0.346. GELU(2) ≈ 1.955.

Derivative is smooth everywhere. No dead corner. No flat tail in the positive region.

SwiGLU — gated smooth bend¶

SwiGLU(x) = Swish(xW₁) ⊙ xW₂

See. One path makes a smooth gate. The other carries content. Multiply them, and the gate decides what passes through. LLaMA, Gemma, and Mistral use this inside FFN blocks because the gate adds expressivity without changing the overall transformer recipe too much.

Side-by-side — the gradient story¶

Activation:        Gradient at +3:    Gradient at -3:    Dead zone?

sigmoid (S)        ~0.045             ~0.045             both tails
tanh (S, centered) ~0.0099            ~0.0099            both tails
ReLU (elbow)       1.0                0                  full negative
GELU (smooth)      ~0.999             ~0.0036            soft, no hard zero

This table is the whole "which bend to pick" decision. We come back to it.

Where each bend ships in production¶

Pick fresh, named cases — specific roles, not generics.

GELU in LLaMA / GPT feed-forward blocks. Every transformer FFN layer in modern LLMs uses GELU between the two linear projections. Smooth gradient lets gradients flow cleanly through hundreds of stacked blocks. Hard corners hurt at this depth.
ReLU in ResNet image backbones. The classic vision residual block uses ReLU after each convolution. Cheap to compute. Sparse activations (~50% zeros) make matmul kernels fast.
Tanh in LSTM / GRU gates and cell states. Recurrent cells need a zero-centered, bounded squash so the cell state does not blow up across timesteps. Tanh has been the default there because (-1, 1) is stable across many time-unrolls.
Swish / SiLU in EfficientNet and modern vision backbones. Like GELU — smooth, slightly negative-passing. Wins small accuracy points on ImageNet-scale training over plain ReLU.
Softmax as an activation in attention. The attention layer is a linear matmul, then softmax over the key axis. Softmax is the bend that turns raw scores into a probability distribution — without it, attention is just a weighted sum and collapses like the rule pile.

Pause and recall. Without scrolling — what is sigmoid's derivative at x = 0, and at x = 5? Why does ReLU's flat negative side matter for training? Where does "the bend" from ELI5 fit in this whole picture?

When to pick which bend¶

A short decision list. Follow top-down.

Hidden layers of a deep feed-forward / vision net? ReLU first. Cheap, sparse, gradient = 1 in positive region. Default for ~80% of cases.
Hidden layers of a transformer FFN? GELU. Smooth corner matters at depth. Or SwiGLU in newer architectures.
Recurrent state update inside an LSTM/GRU cell? Tanh. Bounded + zero-centered keeps state stable across unrolls.
Output layer for binary probability? Sigmoid. Squash to (0, 1).
Output layer for multi-class probability? Softmax. Normalizes a vector into a distribution.
Hidden layer in a deep vanilla net using sigmoid? Don't. Gradients vanish. We cover this in 08-vanishing-gradients.md.

Why ReLU does not "kill" depth the way sigmoid does¶

See. Sigmoid's derivative is at most 0.25. Stack ten sigmoid layers. Gradients get multiplied — 0.25¹⁰ ≈ 0.0000001. The nudge dies before reaching layer 1.

ReLU's derivative is 1 (or 0). Multiply ten 1's — still 1. The nudge arrives intact.

But wait. ReLU's zero side. If many neurons sit in negative territory, their gradient is dead zero forever. Half the network can become permanent zeros. This is dying ReLU.

The fix — Leaky ReLU (max(0.01x, x)), GELU (smooth, never exactly zero), or careful initialization. We pick this up in the init file.

Q&A¶

Q: Why ReLU and not sigmoid for hidden layers in a deep net?
A: Sigmoid's derivative caps at 0.25 and shrinks fast in the tails. Stack many layers and the nudge multiplies down to nothing — vanishing gradients. ReLU's derivative is 1 in the active region, so nudges pass through unchanged.
Common wrong answer to avoid: "sigmoid is slower to compute." Speed is a tiny factor. The real reason is gradient flow.

Q: Why GELU over ReLU in transformer FFN blocks?
A: Hundreds of stacked blocks need every gradient bit. ReLU's hard zero kills gradient for negative pre-activations — small but real loss at scale. GELU's smooth corner keeps a faint gradient even slightly below zero. Empirically wins on language modeling.
Common wrong answer to avoid: "GELU is more expressive." Both are universal approximators. The win is gradient smoothness, not expressiveness.

Q: Why does tanh survive in RNN cells when ReLU dominates feed-forward nets?
A: An RNN unrolls the same cell many times. Cell state must stay bounded or it explodes. Tanh squashes to (-1, 1) at every step. ReLU has no upper bound — apply it 100 times to a growing state and you get blow-up. Common wrong answer to avoid: "tanh is outdated everywhere." It still fits RNN hidden states because zero-centered outputs reduce drift across many timesteps.

Q: Is softmax really an activation function?
A: Yes — it is the bend on the output of the attention pre-scores, and on the output of a multi-class classifier. It just operates on a vector instead of element-wise. Same role: turn raw scores into something non-linear and bounded.
Common wrong answer to avoid: "softmax is just a normalizer." It is non-linear. Different inputs that have the same sum can still produce wildly different outputs.

Apply now (5 min)¶

Take a blank sheet. Without notes, sketch:

The four activation curves — sigmoid, tanh, ReLU, GELU. Side by side. Mark the y-axis range for each.
Below each, sketch its derivative curve. Mark the peak value.
Annotate where each goes flat — that is the dead zone where the nudge dies.

Then answer in one line each — when do you reach for ReLU, GELU, tanh, sigmoid?

If you can do this in under 4 minutes from memory, you own the bend. The rule pile will not collapse on your watch.

Bridge. You know what bends to use and why. Next — how a layer of bends actually computes. Stretch, shift, bend, repeat. The forward pass. Read 03-forward-pass.md.