Skip to content

05. Positional encoding — the seat number that saves word order

Same words. Same meanings. Different order. See how one small add-on saves the sentence.

Built on the ELI5 in 00-eli5.md. The seat number — the extra signal telling each token where it sits — now becomes vectors, clocks, and numbers.


The picture before the math

Imagine four students entering an exam hall. Each student wears a subject badge. Math. History. English. Science. Now imagine two students both wear the same badge: bank. One sits near river. One sits near loan. If you only read the badge, both look identical. If you also read the seat number, confusion drops. That is the job here. Token embedding says what the token is. Positional encoding says where it is. The model adds both. So every token arrives with content plus place.

word meaning  +  seat number  =  usable input vector
content       +  position      =  transformer input
See. Without seat numbers, order becomes slippery. With seat numbers, first, middle, and last stop looking the same.


Why order matters immediately

Take these two sentences: dog bites man man bites dog A bag-of-words system only sees the multiset:

{dog, bites, man}
Same three tokens. Same counts. So it thinks both sentences are the same. But meaning has flipped. One is ordinary. One is headline material. Word identity alone cannot save you. Order carries meaning. Grammar lives in order. Cause and effect often live in order. Negation scope often lives in order. Question vs statement can live in order. So what to do? Give every token a usable notion of place.


Three rescues that still fail

Rescue 1: Make embeddings bigger

You may say, "Fine. Let the badge board learn richer token vectors." Nice thought. Still broken. dog gets one richer vector. man gets one richer vector. But nothing inside those vectors says who came first. Bigger content does not create order.

Rescue 2: Pool everything together

You may average token vectors. Or sum them. Now the sentence becomes one pooled blob. But averages destroy arrangement. dog bites man and man bites dog still collapse together. Pooling is useful later. It is not a replacement for position.

Rescue 3: Add n-grams and hope

You may append bigrams or trigrams. Now some local order appears. Good. But only locally. Long-distance structure still slips away. And feature count explodes. For long sequences, this rescue becomes clumsy. So all three attempts help a little. None solves the core issue cleanly.


The one-line fix

Add a position vector to each token vector. That is it.

input_vector(pos) = token_emb(token) + pos_emb(pos)
Same token at different places now gets different final inputs. bank at position 2:
[token meaning] + [seat 2]
bank at position 9:
[token meaning] + [seat 9]
Same word. Different place. Different final vector. Attention now has something to work with.


The sinusoidal picture

One learned option is simple lookup. Position 0 has a learned vector. Position 1 has another. And so on. But sinusoidal encoding gives a neat picture. Picture many clock hands. Each pair of dimensions is one tiny clock. Some clocks spin fast. Some spin slowly. So each position becomes a bundle of angles. Nearby positions have nearby angles. Far positions have very different angle patterns.

fast clock :  |  /  -  \
slow clock :  |  |  /  /
position   :  0  1  2  3
The fast pair changes a lot from one token to the next. The slow pair changes gently. Together they create a multi-scale seat number. Nice, no?


The formula

For model dimension d, position pos, and pair index i:

PE(pos, 2i)   = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
Even index gets sine. Odd index gets cosine. Same denominator for the pair. So dimensions (0,1) form one clock. Dimensions (2,3) form another clock. If d_model = 4, then: - pair (0,1) uses denominator 10000^(0/4) = 1 - pair (2,3) uses denominator 10000^(2/4) = 100 Now you can already see it. First pair spins fast. Second pair spins slow.


Worked example: d_model = 4, positions 0 to 3

We compute four values per position.

[sin(pos), cos(pos), sin(pos/100), cos(pos/100)]
Use rough values. That is enough for intuition.

Position 0

sin(0)      = 0.0000
cos(0)      = 1.0000
sin(0/100)  = 0.0000
cos(0/100)  = 1.0000
PE(0) = [0.0000, 1.0000, 0.0000, 1.0000]

Position 1

sin(1)      ≈ 0.8415
cos(1)      ≈ 0.5403
sin(0.01)   ≈ 0.0100
cos(0.01)   ≈ 0.99995
PE(1) = [0.8415, 0.5403, 0.0100, 0.99995]

Position 2

sin(2)      ≈ 0.9093
cos(2)      ≈ -0.4161
sin(0.02)   ≈ 0.0200
cos(0.02)   ≈ 0.99980
PE(2) = [0.9093, -0.4161, 0.0200, 0.99980]

Position 3

sin(3)      ≈ 0.1411
cos(3)      ≈ -0.9900
sin(0.03)   ≈ 0.0300
cos(0.03)   ≈ 0.99955
PE(3) = [0.1411, -0.9900, 0.0300, 0.99955]

Fast pair vs slow pair

Look only at dimensions 0 and 1.

pos 0 -> [0.0000,  1.0000]
pos 1 -> [0.8415,  0.5403]
pos 2 -> [0.9093, -0.4161]
pos 3 -> [0.1411, -0.9900]
Big swings. Now dimensions 2 and 3.
pos 0 -> [0.0000, 1.0000]
pos 1 -> [0.0100, 0.99995]
pos 2 -> [0.0200, 0.99980]
pos 3 -> [0.0300, 0.99955]
Tiny shifts. So one pair catches short-range position. The other pair changes slowly and helps with longer-range distinctions.


Why attention likes this

Attention compares vectors. If two tokens have the same content vector, they can still differ by place. That matters a lot. the at the start of a sentence is not always used like the near the end. A subject token and an object token may be the same word. Position helps separate them. Now the model can notice patterns like: - something near the start often behaves like a subject - something after not changes interpretation - something just before ? may complete a question Content plus place becomes distinguishable. That is the real win.


Where this lives in the wild

  • Google Search / Translate. Word order changes meaning across languages. Positional signals help preserve subject-object order when translating.
  • OpenAI ChatGPT. The model must track where instructions, examples, and the latest user turn sit inside a long prompt.
  • GitHub Copilot. In code, return near the end of a function means something very different from return inside an inner branch.
  • Notion AI. Summaries and rewrites depend on the sequence of headings, bullets, and conclusion lines.
  • Amazon Alexa. Voice commands like turn off bedroom light after 10 minutes depend on token order, not just token presence.

Interview Q&A

Q: Why can't a transformer just rely on token embeddings to know order? A: Because the embedding table encodes token identity, not token index. The vector for dog is reused everywhere unless a separate positional signal is added. Common wrong answer to avoid: "The model will infer order from context automatically." Without a seat number, the input vectors for reordered tokens can be identical as a set. Q: Why add position vectors instead of concatenating them? A: Addition keeps the width fixed at d_model, so every layer expects the same shape. Concatenation doubles width and forces later projections to relearn the merge. Q: What is the intuition behind sinusoidal encoding? A: Different dimension pairs behave like clock hands rotating at different speeds. A position is identified by the combined angle pattern across fast and slow clocks. Common wrong answer to avoid: "Sine and cosine are used only because they look smooth." Smoothness helps, but the key idea is a structured, multi-frequency code for relative and absolute patterns. Q: Why does positional encoding help attention specifically? A: Attention compares token representations. Once position is mixed in, attention can tell apart same-word tokens appearing in different places and can learn place-sensitive patterns.


Apply now (5 min)

Take the sentence man bites dog. 1. Write the three token embeddings as m, b, d. 2. Add seat vectors p0, p1, p2. 3. Write the final inputs as m+p0, b+p1, d+p2. 4. Now swap the sentence to dog bites man and write the new sums. 5. Say out loud why the model can now distinguish the two orders. Then sketch from memory: - one fast clock pair - one slow clock pair - the formula token_emb + pos_emb - the dog bites man vs man bites dog contrast If you can redraw the clocks and explain why bigger embeddings still fail, you own the seat number.


Bridge. Absolute seat numbers help, but long context still hurts. The next file shows two stronger tricks for relative position — rotating vectors with RoPE and subtracting distance penalties with ALiBi. Read 06-rope-alibi.md next.