Assignment 3 — Pre-Norm Transformer Block¶

This folder implements the Week 3 hands_on_lab from extended_notes/05_hands_on_lab.md.

Files¶

attention.py — scaled dot-product attention with optional mask support
block.py — pre-norm transformer block with self-attention, FFN, and residual adds
verify.py — toy checks for shapes, masking, pre-norm order, and residual behavior
block_diagram.txt — ASCII block diagram matching the module explainer

Plain-language explanation¶

Residual stream¶

The residual stream is the packet moving through the full stack. Each block reads that packet, proposes a small edit, and adds the edit back. So the model keeps one main highway of meaning instead of replacing the whole state each time.

LayerNorm¶

LayerNorm is the quality inspector before each heavy step. It rescales one token across its feature dimension so the next sublayer sees a stable input range. In this hands_on_lab the order is pre-norm:

x = x + Attention(LayerNorm(x))
x = x + FFN(LayerNorm(x))

Causal mask¶

The causal mask is the future-blocking rule for decoder-style behavior. Token 3 may read tokens 1-3, but not token 4. We apply the mask to attention scores before softmax so blocked future positions get zero probability.

Run¶

python3 verify.py

Configuration used here¶

d_model = 128
n_heads = 4
d_ff = 512

What the verification script checks¶

Attention output shape matches the expected width.
Causal masking zeroes attention to future positions.
Residual additions preserve d_model.
LayerNorm runs before each sublayer.
Zeroed sublayers reduce the block to an identity path through the residual stream.