06. RoPE and ALiBi — relative position for long context¶

Absolute seat numbers help early. Long prompts expose their weakness. See the relative fix.

Built on the ELI5 in 00-eli5.md. The seat number — the place signal added to each token — now shifts from absolute slots to relative distance.

The picture before the math¶

Absolute positions feel neat. Token 0 gets one vector. Token 1 gets another. Token 2047 gets another. But long context asks a harder question. Not just, "Where am I?" Also, "How far am I from the useful thing?" That is a relative question. Picture two students in a classroom. One asks a doubt. The other answers. You often care less about their absolute roll numbers. You care about their distance. Are they adjacent? Three seats apart? Forty seats apart? RoPE and ALiBi lean into that idea. They make attention sensitive to distance, not just absolute slot ID.

Why absolute positions feel brittle¶

Learned positional embeddings stop at a max length. Train with max length 2048. Now ask the model to read token 4096. There is no learned vector for that slot. Game over. Sinusoidal encoding can extend beyond training length. Good. But extension is not the same as strong extrapolation. The model may technically receive numbers for far positions. Still, behaviour can degrade. Why? Because many useful patterns are relative. "Look back 3 tokens." "Attend strongly to the previous line." "Discount things that are far away." Absolute slot IDs do not express those relations directly.

Three attempts that sound fine but still hurt¶

Attempt 1: Train on longer sequences¶

Yes, this helps. But compute cost rises sharply. Memory rises. Data packing gets harder. You still may not cover the next real production length.

Attempt 2: Increase the max-position table¶

You may expand learned slots from 2048 to 8192. Nice. Now you store more vectors. But the model still treats them as separate absolute IDs. Distance structure is not built in.

Attempt 3: Hope sinusoidal alone will generalize¶

Sometimes it works okay. Sometimes it fades. Especially when training mostly saw shorter spans. So what to do? Bake relative behaviour into the attention mechanism itself.

RoPE mental model¶

RoPE means Rotary Positional Embedding. Do not get scared by the name. Picture every adjacent 2-D pair as a tiny arrow on a compass. For each position, rotate that arrow by some angle. Position 1 rotates a little. Position 2 rotates more. Position 100 rotates much more. And you rotate both Q and K using the same rule. Now their dot product depends on the angle gap. That means it depends on relative distance. Very elegant.

base vector -->
pos 0 : -->
pos 1 :  /
pos 2 :  |
pos 3 :  \

Different positions are different rotations of the same content directions. So similarity carries distance information automatically.

RoPE formula picture¶

Take one 2-D pair. Original vector:

[x, y]

Rotate by angle θ_pos:

[x', y'] = [x cosθ - y sinθ,
x sinθ + y cosθ]

Transformers do this for many 2-D pairs at different frequencies. Fast pairs rotate quickly. Slow pairs rotate slowly. Same multi-clock idea as sinusoidal. But now the rotation happens inside Q and K. That is the key shift. RoPE does not just add a seat vector. It rotates the comparison space itself.

Worked RoPE example in 2-D¶

Use a base vector:

[1, 0]

Nice and clean. Rotate the query at position 1 by 30°. Rotate the key at position 3 by 90°.

Step 1: rotate the query¶

For 30°:

cos 30° ≈ 0.866
sin 30° = 0.5

So:

q_rot = [1·0.866 - 0·0.5,
1·0.5   + 0·0.866]
= [0.866, 0.5]

Step 2: rotate the key¶

For 90°:

cos 90° = 0
sin 90° = 1

So:

k_rot = [1·0 - 0·1,
1·1 + 0·0]
= [0, 1]

Step 3: compute the dot product¶

q_rot · k_rot = 0.866·0 + 0.5·1 = 0.5

See what happened. The score depends on the angle gap. Query angle is 30°. Key angle is 90°. Gap is 60°. And cos 60° = 0.5. So the similarity reflects relative position. Very neat.

ALiBi mental model¶

ALiBi is simpler. No rotation. No trigonometry inside vectors. Just change the score directly. Name says it: Attention with Linear Biases. Start with the raw attention score. Then subtract a penalty based on distance. Farther token? Bigger penalty. Closer token? Smaller penalty. So the model starts with a built-in preference for nearer context.

adjusted score = raw score - slope × distance

That is the whole picture. Easy to implement. Easy to scale.

Worked ALiBi example¶

Query is at position 4. Suppose raw scores for four candidate keys are:

[8, 7, 6, 5]

Suppose their distances from the query are:

[3, 2, 1, 0]

Take slope m = 0.5.

Step 1: compute penalties¶

0.5 × [3, 2, 1, 0] = [1.5, 1.0, 0.5, 0.0]

Step 2: subtract from raw scores¶

adjusted = [8, 7, 6, 5] - [1.5, 1.0, 0.5, 0.0]
= [6.5, 6.0, 5.5, 5.0]

Now the nearby keys are relatively favoured. Even if a far key had a good raw score, distance pulls it down. That helps long-context stability. And unlike a learned table, this penalty rule works beyond the seen maximum length.

When to use which¶

RoPE is richer. It preserves a geometric notion of relative phase inside Q and K. That is why many decoder LLMs use it. ALiBi is simpler. It adds a direct distance penalty to scores. That makes extrapolation story cleaner and implementation lighter. A useful rule of thumb: - choose RoPE when you want strong default performance in modern transformer stacks - choose ALiBi when simplicity and length extrapolation matter more than geometric richness See. Both are answers to the same pain. Absolute seat numbers alone feel brittle on long prompts.

ASCII picture: absolute vs relative¶

absolute view:
slot 0   slot 1   slot 2   slot 3   slot 4
|        |        |        |        |
A        B        C        D        E
relative view from token E:
E looks to D : distance 1
E looks to C : distance 2
E looks to B : distance 3
E looks to A : distance 4

Long context usually cares about the second picture.

Where this lives in the wild¶

Meta Llama. Llama-family models use RoPE so attention keeps a relative-position signal even in long prompts.
Mistral and Mixtral. These long-context decoder models also rely on RoPE-style relative geometry for practical context handling.
Falcon. Falcon popularized ALiBi in open models because the score penalty is simple and length-friendly.
Anthropic Claude. Long-document assistants need position schemes that stay stable as prompts become very large; relative methods are part of that design space.
Code completion products like GitHub Copilot. Long files and repeated symbols make relative distance especially important when retrieving the right earlier token.

Interview Q&A¶

Q: Why do absolute learned positional embeddings fail beyond max length? A: Because the model literally has no learned vector for unseen slots past the table size. If training stopped at 2048, position 4096 has no representation. Common wrong answer to avoid: "The model can interpolate automatically." Learned tables do not magically define unseen rows. Q: What is the core idea of RoPE in one line? A: Rotate every 2-D pair of Q and K by a position-dependent angle so their dot product carries relative distance information. Q: What is the core idea of ALiBi in one line? A: Subtract a linear penalty from attention scores based on token distance, so farther tokens start with a handicap. Common wrong answer to avoid: "ALiBi changes the values V." It changes attention scores, not the contributed value vectors. Q: When would you prefer ALiBi over RoPE? A: When you want a very simple mechanism, easy extrapolation, and lower implementation complexity. RoPE is often stronger by default, but ALiBi is cleaner.

Apply now (5 min)¶

Do two mini drills. 1. Draw the vector [1,0] on x-y axes. 2. Rotate it by 45° and write the new coordinates. 3. Rotate another copy by 90°. 4. Compute their dot product. 5. Say which relative angle that dot product reflects. Then do the ALiBi side. 1. Take raw scores [9, 8, 7]. 2. Distances [2, 1, 0]. 3. Slope 0.25. 4. Compute the adjusted scores. 5. Say which key becomes most favoured. Then sketch from memory: - a rotating arrow for RoPE - a straight penalty line for ALiBi - the sentence relative distance matters more than absolute slot If you can explain why RoPE rotates and ALiBi subtracts, long-context position has clicked.

Bridge. Position is now handled well enough. Next comes the real mechanism that uses it: one token directly consulting other tokens. That is attention as soft lookup. Read 07-attention-as-lookup.md next.