05. Positional encoding — the seat number that saves word order¶
Same words. Same meanings. Different order. See how one small add-on saves the sentence.
Built on the ELI5 in
00-eli5.md. The seat number — the extra signal telling each token where it sits — now becomes vectors, clocks, and numbers.
The picture before the math¶
Imagine four students entering an exam hall. Each student wears a subject
badge. Math. History. English. Science. Now imagine two students both wear the
same badge: bank. One sits near river. One sits near loan. If you only
read the badge, both look identical. If you also read the seat number,
confusion drops. That is the job here. Token embedding says what the token
is. Positional encoding says where it is. The model adds both. So every
token arrives with content plus place.
first,
middle, and last stop looking the same.
Why order matters immediately¶
Take these two sentences: dog bites man man bites dog A bag-of-words
system only sees the multiset:
Three rescues that still fail¶
Rescue 1: Make embeddings bigger¶
You may say, "Fine. Let the badge board learn richer token vectors." Nice
thought. Still broken. dog gets one richer vector. man gets one richer
vector. But nothing inside those vectors says who came first. Bigger content
does not create order.
Rescue 2: Pool everything together¶
You may average token vectors. Or sum them. Now the sentence becomes one
pooled blob. But averages destroy arrangement. dog bites man and man bites
dog still collapse together. Pooling is useful later. It is not a replacement
for position.
Rescue 3: Add n-grams and hope¶
You may append bigrams or trigrams. Now some local order appears. Good. But only locally. Long-distance structure still slips away. And feature count explodes. For long sequences, this rescue becomes clumsy. So all three attempts help a little. None solves the core issue cleanly.
The one-line fix¶
Add a position vector to each token vector. That is it.
Same token at different places now gets different final inputs.bank at
position 2:
bank at position 9:
Same word. Different place. Different final vector. Attention now has
something to work with.
The sinusoidal picture¶
One learned option is simple lookup. Position 0 has a learned vector. Position 1 has another. And so on. But sinusoidal encoding gives a neat picture. Picture many clock hands. Each pair of dimensions is one tiny clock. Some clocks spin fast. Some spin slowly. So each position becomes a bundle of angles. Nearby positions have nearby angles. Far positions have very different angle patterns.
The fast pair changes a lot from one token to the next. The slow pair changes gently. Together they create a multi-scale seat number. Nice, no?The formula¶
For model dimension d, position pos, and pair index i:
(0,1) form one clock. Dimensions (2,3) form another clock. If
d_model = 4, then:
- pair (0,1) uses denominator 10000^(0/4) = 1
- pair (2,3) uses denominator 10000^(2/4) = 100
Now you can already see it. First pair spins fast. Second pair spins slow.
Worked example: d_model = 4, positions 0 to 3¶
We compute four values per position.
Use rough values. That is enough for intuition.Position 0¶
sin(0) = 0.0000
cos(0) = 1.0000
sin(0/100) = 0.0000
cos(0/100) = 1.0000
PE(0) = [0.0000, 1.0000, 0.0000, 1.0000]
Position 1¶
sin(1) ≈ 0.8415
cos(1) ≈ 0.5403
sin(0.01) ≈ 0.0100
cos(0.01) ≈ 0.99995
PE(1) = [0.8415, 0.5403, 0.0100, 0.99995]
Position 2¶
sin(2) ≈ 0.9093
cos(2) ≈ -0.4161
sin(0.02) ≈ 0.0200
cos(0.02) ≈ 0.99980
PE(2) = [0.9093, -0.4161, 0.0200, 0.99980]
Position 3¶
sin(3) ≈ 0.1411
cos(3) ≈ -0.9900
sin(0.03) ≈ 0.0300
cos(0.03) ≈ 0.99955
PE(3) = [0.1411, -0.9900, 0.0300, 0.99955]
Fast pair vs slow pair¶
Look only at dimensions 0 and 1.
pos 0 -> [0.0000, 1.0000]
pos 1 -> [0.8415, 0.5403]
pos 2 -> [0.9093, -0.4161]
pos 3 -> [0.1411, -0.9900]
pos 0 -> [0.0000, 1.0000]
pos 1 -> [0.0100, 0.99995]
pos 2 -> [0.0200, 0.99980]
pos 3 -> [0.0300, 0.99955]
Why attention likes this¶
Attention compares vectors. If two tokens have the same content vector, they
can still differ by place. That matters a lot. the at the start of a
sentence is not always used like the near the end. A subject token and an
object token may be the same word. Position helps separate them. Now the model
can notice patterns like:
- something near the start often behaves like a subject
- something after not changes interpretation
- something just before ? may complete a question
Content plus place becomes distinguishable. That is the real win.
Where this lives in the wild¶
- Google Search / Translate. Word order changes meaning across languages. Positional signals help preserve subject-object order when translating.
- OpenAI ChatGPT. The model must track where instructions, examples, and the latest user turn sit inside a long prompt.
- GitHub Copilot. In code,
returnnear the end of a function means something very different fromreturninside an inner branch. - Notion AI. Summaries and rewrites depend on the sequence of headings, bullets, and conclusion lines.
- Amazon Alexa. Voice commands like
turn off bedroom light after 10 minutesdepend on token order, not just token presence.
Interview Q&A¶
Q: Why can't a transformer just rely on token embeddings to know order? A:
Because the embedding table encodes token identity, not token index. The
vector for dog is reused everywhere unless a separate positional signal is
added. Common wrong answer to avoid: "The model will infer order from
context automatically." Without a seat number, the input vectors for reordered
tokens can be identical as a set. Q: Why add position vectors instead of
concatenating them? A: Addition keeps the width fixed at d_model, so every
layer expects the same shape. Concatenation doubles width and forces later
projections to relearn the merge. Q: What is the intuition behind sinusoidal
encoding? A: Different dimension pairs behave like clock hands rotating at
different speeds. A position is identified by the combined angle pattern
across fast and slow clocks. Common wrong answer to avoid: "Sine and cosine
are used only because they look smooth." Smoothness helps, but the key idea is
a structured, multi-frequency code for relative and absolute patterns. Q:
Why does positional encoding help attention specifically? A: Attention
compares token representations. Once position is mixed in, attention can tell
apart same-word tokens appearing in different places and can learn
place-sensitive patterns.
Apply now (5 min)¶
Take the sentence man bites dog.
1. Write the three token embeddings as m, b, d.
2. Add seat vectors p0, p1, p2.
3. Write the final inputs as m+p0, b+p1, d+p2.
4. Now swap the sentence to dog bites man and write the new sums.
5. Say out loud why the model can now distinguish the two orders.
Then sketch from memory:
- one fast clock pair
- one slow clock pair
- the formula token_emb + pos_emb
- the dog bites man vs man bites dog contrast
If you can redraw the clocks and explain why bigger embeddings still fail, you
own the seat number.
Bridge. Absolute seat numbers help, but long context still hurts. The next file shows two stronger tricks for relative position — rotating vectors with RoPE and subtracting distance penalties with ALiBi. Read
06-rope-alibi.mdnext.