Assignment 3 — Pre-Norm Transformer Block¶
This folder implements the Week 3 hands_on_lab from extended_notes/05_hands_on_lab.md.
Files¶
attention.py— scaled dot-product attention with optional mask supportblock.py— pre-norm transformer block with self-attention, FFN, and residual addsverify.py— toy checks for shapes, masking, pre-norm order, and residual behaviorblock_diagram.txt— ASCII block diagram matching the module explainer
Plain-language explanation¶
Residual stream¶
The residual stream is the packet moving through the full stack. Each block reads that packet, proposes a small edit, and adds the edit back. So the model keeps one main highway of meaning instead of replacing the whole state each time.
LayerNorm¶
LayerNorm is the quality inspector before each heavy step. It rescales one token across its feature dimension so the next sublayer sees a stable input range. In this hands_on_lab the order is pre-norm:
Causal mask¶
The causal mask is the future-blocking rule for decoder-style behavior. Token 3 may read tokens 1-3, but not token 4. We apply the mask to attention scores before softmax so blocked future positions get zero probability.
Run¶
Configuration used here¶
d_model = 128n_heads = 4d_ff = 512
What the verification script checks¶
- Attention output shape matches the expected width.
- Causal masking zeroes attention to future positions.
- Residual additions preserve
d_model. - LayerNorm runs before each sublayer.
- Zeroed sublayers reduce the block to an identity path through the residual stream.