Skip to content

Assignment 3 — Pre-Norm Transformer Block

This folder implements the Week 3 hands_on_lab from extended_notes/05_hands_on_lab.md.

Files

  • attention.py — scaled dot-product attention with optional mask support
  • block.py — pre-norm transformer block with self-attention, FFN, and residual adds
  • verify.py — toy checks for shapes, masking, pre-norm order, and residual behavior
  • block_diagram.txt — ASCII block diagram matching the module explainer

Plain-language explanation

Residual stream

The residual stream is the packet moving through the full stack. Each block reads that packet, proposes a small edit, and adds the edit back. So the model keeps one main highway of meaning instead of replacing the whole state each time.

LayerNorm

LayerNorm is the quality inspector before each heavy step. It rescales one token across its feature dimension so the next sublayer sees a stable input range. In this hands_on_lab the order is pre-norm:

x = x + Attention(LayerNorm(x))
x = x + FFN(LayerNorm(x))

Causal mask

The causal mask is the future-blocking rule for decoder-style behavior. Token 3 may read tokens 1-3, but not token 4. We apply the mask to attention scores before softmax so blocked future positions get zero probability.

Run

python3 verify.py

Configuration used here

  • d_model = 128
  • n_heads = 4
  • d_ff = 512

What the verification script checks

  1. Attention output shape matches the expected width.
  2. Causal masking zeroes attention to future positions.
  3. Residual additions preserve d_model.
  4. LayerNorm runs before each sublayer.
  5. Zeroed sublayers reduce the block to an identity path through the residual stream.