06. RoPE and ALiBi — relative position for long context¶
Absolute seat numbers help early. Long prompts expose their weakness. See the relative fix.
Built on the ELI5 in
00-eli5.md. The seat number — the place signal added to each token — now shifts from absolute slots to relative distance.
The picture before the math¶
Absolute positions feel neat. Token 0 gets one vector. Token 1 gets another. Token 2047 gets another. But long context asks a harder question. Not just, "Where am I?" Also, "How far am I from the useful thing?" That is a relative question. Picture two students in a classroom. One asks a doubt. The other answers. You often care less about their absolute roll numbers. You care about their distance. Are they adjacent? Three seats apart? Forty seats apart? RoPE and ALiBi lean into that idea. They make attention sensitive to distance, not just absolute slot ID.
Why absolute positions feel brittle¶
Learned positional embeddings stop at a max length. Train with max length
2048. Now ask the model to read token 4096. There is no learned vector for
that slot. Game over. Sinusoidal encoding can extend beyond training length.
Good. But extension is not the same as strong extrapolation. The model may
technically receive numbers for far positions. Still, behaviour can degrade.
Why? Because many useful patterns are relative. "Look back 3 tokens." "Attend
strongly to the previous line." "Discount things that are far away." Absolute
slot IDs do not express those relations directly.
Three attempts that sound fine but still hurt¶
Attempt 1: Train on longer sequences¶
Yes, this helps. But compute cost rises sharply. Memory rises. Data packing gets harder. You still may not cover the next real production length.
Attempt 2: Increase the max-position table¶
You may expand learned slots from 2048 to 8192. Nice. Now you store more vectors. But the model still treats them as separate absolute IDs. Distance structure is not built in.
Attempt 3: Hope sinusoidal alone will generalize¶
Sometimes it works okay. Sometimes it fades. Especially when training mostly saw shorter spans. So what to do? Bake relative behaviour into the attention mechanism itself.
RoPE mental model¶
RoPE means Rotary Positional Embedding. Do not get scared by the name. Picture
every adjacent 2-D pair as a tiny arrow on a compass. For each position,
rotate that arrow by some angle. Position 1 rotates a little. Position 2
rotates more. Position 100 rotates much more. And you rotate both Q and K
using the same rule. Now their dot product depends on the angle gap. That
means it depends on relative distance. Very elegant.
RoPE formula picture¶
Take one 2-D pair. Original vector:
Rotate by angleθ_pos:
Transformers do this for many 2-D pairs at different frequencies. Fast pairs
rotate quickly. Slow pairs rotate slowly. Same multi-clock idea as sinusoidal.
But now the rotation happens inside Q and K. That is the key shift. RoPE
does not just add a seat vector. It rotates the comparison space itself.
Worked RoPE example in 2-D¶
Use a base vector:
Nice and clean. Rotate the query at position1 by 30°. Rotate the key at
position 3 by 90°.
Step 1: rotate the query¶
For 30°:
Step 2: rotate the key¶
For 90°:
Step 3: compute the dot product¶
See what happened. The score depends on the angle gap. Query angle is30°.
Key angle is 90°. Gap is 60°. And cos 60° = 0.5. So the similarity
reflects relative position. Very neat.
ALiBi mental model¶
ALiBi is simpler. No rotation. No trigonometry inside vectors. Just change the score directly. Name says it: Attention with Linear Biases. Start with the raw attention score. Then subtract a penalty based on distance. Farther token? Bigger penalty. Closer token? Smaller penalty. So the model starts with a built-in preference for nearer context.
That is the whole picture. Easy to implement. Easy to scale.Worked ALiBi example¶
Query is at position 4. Suppose raw scores for four candidate keys are:
m = 0.5.
Step 1: compute penalties¶
Step 2: subtract from raw scores¶
Now the nearby keys are relatively favoured. Even if a far key had a good raw score, distance pulls it down. That helps long-context stability. And unlike a learned table, this penalty rule works beyond the seen maximum length.When to use which¶
RoPE is richer. It preserves a geometric notion of relative phase inside Q
and K. That is why many decoder LLMs use it. ALiBi is simpler. It adds a
direct distance penalty to scores. That makes extrapolation story cleaner and
implementation lighter. A useful rule of thumb:
- choose RoPE when you want strong default performance in modern transformer stacks
- choose ALiBi when simplicity and length extrapolation matter more than geometric richness
See. Both are answers to the same pain. Absolute seat numbers alone feel
brittle on long prompts.
ASCII picture: absolute vs relative¶
absolute view:
slot 0 slot 1 slot 2 slot 3 slot 4
| | | | |
A B C D E
relative view from token E:
E looks to D : distance 1
E looks to C : distance 2
E looks to B : distance 3
E looks to A : distance 4
Where this lives in the wild¶
- Meta Llama. Llama-family models use RoPE so attention keeps a relative-position signal even in long prompts.
- Mistral and Mixtral. These long-context decoder models also rely on RoPE-style relative geometry for practical context handling.
- Falcon. Falcon popularized ALiBi in open models because the score penalty is simple and length-friendly.
- Anthropic Claude. Long-document assistants need position schemes that stay stable as prompts become very large; relative methods are part of that design space.
- Code completion products like GitHub Copilot. Long files and repeated symbols make relative distance especially important when retrieving the right earlier token.
Interview Q&A¶
Q: Why do absolute learned positional embeddings fail beyond max length?
A: Because the model literally has no learned vector for unseen slots past the
table size. If training stopped at 2048, position 4096 has no representation.
Common wrong answer to avoid: "The model can interpolate automatically."
Learned tables do not magically define unseen rows. Q: What is the core idea
of RoPE in one line? A: Rotate every 2-D pair of Q and K by a
position-dependent angle so their dot product carries relative distance
information. Q: What is the core idea of ALiBi in one line? A: Subtract a
linear penalty from attention scores based on token distance, so farther
tokens start with a handicap. Common wrong answer to avoid: "ALiBi changes
the values V." It changes attention scores, not the contributed value
vectors. Q: When would you prefer ALiBi over RoPE? A: When you want a very
simple mechanism, easy extrapolation, and lower implementation complexity.
RoPE is often stronger by default, but ALiBi is cleaner.
Apply now (5 min)¶
Do two mini drills.
1. Draw the vector [1,0] on x-y axes.
2. Rotate it by 45° and write the new coordinates.
3. Rotate another copy by 90°.
4. Compute their dot product.
5. Say which relative angle that dot product reflects.
Then do the ALiBi side.
1. Take raw scores [9, 8, 7].
2. Distances [2, 1, 0].
3. Slope 0.25.
4. Compute the adjusted scores.
5. Say which key becomes most favoured.
Then sketch from memory:
- a rotating arrow for RoPE
- a straight penalty line for ALiBi
- the sentence relative distance matters more than absolute slot
If you can explain why RoPE rotates and ALiBi subtracts, long-context position
has clicked.
Bridge. Position is now handled well enough. Next comes the real mechanism that uses it: one token directly consulting other tokens. That is attention as soft lookup. Read
07-attention-as-lookup.mdnext.