Skip to content

13. Cross-attention — one sequence consulting another

Self-attention is room talk. Cross-attention lets one row query a second row.

Built on the ELI5 in 00-eli5.md. The spotlight beam — one token checking who matters — now shines across two different rows instead of staying inside one.


The picture before the math

See. Self-attention is one classroom. Cross-attention is two classrooms with a glass window. One row holds the source information. Another row holds the unfinished output. The output row asks questions. The source row supplies memory. So the spotlight beam crosses the room boundary.

encoder row : e1  e2  e3  e4  e5
decoder row : d1  d2  d3
d2 spotlight ----> e1
               ---> e3
               ---> e5
That is the whole picture.

Self-attention vs cross-attention

In self-attention, queries, keys, and values all come from the same sequence.

self-attention
Q <- X
K <- X
V <- X
In cross-attention, queries come from one sequence, but keys and values come from another.
cross-attention
Q <- decoder sequence
K <- encoder sequence
V <- encoder sequence
So the scorecard asks a different question. Self-attention asks: "which tokens in my own row matter to me?" Cross-attention asks: "which tokens in that other row matter to me?" Simple, no?

Why we need this extra bridge

Suppose we are translating. Source sentence: the red car is fast Target sentence is still being generated. If the decoder is about to write voiture, it needs the source token car. That clue does not live inside the partially generated target row. It lives in the source row. So masked self-attention alone is not enough. We need a second spotlight beam. That beam goes from decoder to encoder. This is the encoder-decoder bridge.

One clean block picture

In a classic decoder block, the order is:

decoder states
   |
   v
masked self-attention
   |
   v
cross-attention over encoder outputs
   |
   v
feed-forward network
Why this order? First, the decoder organizes what it has generated so far. Then, it consults the source memory. Then, it transforms the mixed result further. That is the usual rhythm.

Picture before matrix shapes

Keep two rectangles in mind.

encoder output E : 5 x d
decoder state D  : 3 x d
Now project them.
Q = D W_Q   -> 3 x d_k
K = E W_K   -> 5 x d_k
V = E W_V   -> 5 x d_v
Then the scorecard is:
scores = Q K^T -> 3 x 5
Why 3 x 5? Three decoder queries. Five encoder memory slots. Each decoder token scores every encoder token. Then:
weights = row_softmax(scores)
output  = weights V -> 3 x d_v
So each decoder row gets a weighted mix of encoder values.

ASCII diagram — where Q, K, and V come from

decoder states (3 x d) --W_Q--> Q (3 x d_k)
                               |
                               v
                        scores QK^T (3 x 5)
                               ^
                               |
encoder states (5 x d) --W_K--> K (5 x d_k)
encoder states (5 x d) --W_V--> V (5 x d_v)
weights = softmax(scores by row)
output  = weights x V
That is cross-attention in one picture.

Worked example — 3 decoder queries, 5 encoder outputs

Use a tiny translation-style setup. Encoder tokens:

e1 = the
e2 = red
e3 = car
e4 = is
e5 = fast
Decoder has three query positions:
d1 = la
d2 = voiture
d3 = rapide
Suppose the raw scorecard is:
             e1    e2    e3    e4    e5
d1 "la"      2.2   0.2   0.8  -0.5  -0.8
d2 "voiture" 0.1   0.5   2.3   0.2  -0.4
d3 "rapide" -0.7   0.3   0.2   1.1   2.4
After row-wise softmax, imagine we get:
d1 -> [0.63, 0.08, 0.18, 0.06, 0.05]
d2 -> [0.07, 0.11, 0.64, 0.10, 0.08]
d3 -> [0.03, 0.08, 0.07, 0.25, 0.57]
Read the rows slowly. d1 = la looks mostly at the. Good. d2 = voiture looks mostly at car. Very good. d3 = rapide looks mostly at fast, with some help from is. Also sensible. So the scorecard shape is:
3 decoder queries x 5 encoder memories
Not square. That is the first visual clue that this is cross-attention.

One weighted-sum calculation

Now give the encoder values tiny 2D vectors.

the  -> [1, 0]
red  -> [2, 1]
car  -> [5, 1]
is   -> [0, 2]
fast -> [1, 5]
Use the d2 = voiture weights:
[0.07, 0.11, 0.64, 0.10, 0.08]
Compute the mixed value.
o2
= 0.07[1,0] + 0.11[2,1] + 0.64[5,1] + 0.10[0,2] + 0.08[1,5]
= [0.07,0.00] + [0.22,0.11] + [3.20,0.64] + [0.00,0.20] + [0.08,0.40]
= [3.57, 1.35]
See what happened. The output for voiture mostly copies semantic mass from car, not from every source token equally. That is the bridge working.

How cross-attention differs from self-attention in practice

1) Q comes from the decoder side

The query changes at each target position. So each partially generated token asks a different question.

2) K and V come from the encoder side

The encoder row is already finished. Its badge-board vectors and seat numbers are stable for this example. So we compute encoder K and V once. During decoding, we keep reusing them.

3) No causal mask is needed on the KV side

Why? Because the source sequence is fully known already. The model is not cheating by looking at source token 5. Source token 5 is supposed to be visible. Important nuance: the decoder still needs causal masking inside its own self-attention block. So: no future peeking in the target row, but full visibility over the source row.

4) The attention matrix is usually rectangular

Self-attention inside one sequence often gives L x L. Cross-attention gives:

target_length x source_length
That alone tells you two different rows are involved.

Three clean use cases

Translation

Decoder query: "what source word helps me write the next target word?" That is the classic encoder-decoder bridge.

Image captioning

Encoder memory: image patches or visual regions. Decoder query: "which patch matters for the next word in the caption?" So dog may attend to the dog's region. grass may attend somewhere else.

Retrieval-augmented generation

In encoder-decoder RAG, retrieved passages become the encoder memory. The answer decoder asks: which retrieved token or sentence helps me write this next word? So the spotlight beam crosses from answer tokens to retrieved evidence.

One more compact comparison

self-attention
--------------
query token asks its own sequence
cross-attention
---------------
query token asks a different sequence
If the query row and memory row are the same, it is self-attention. If the query row and memory row differ, it is cross-attention. That is the entire test.

Where this lives in the wild

  • Google Translate uses decoder queries over source-language encoder states to write the target sentence.
  • Microsoft Translator uses the same bridge so each target token can consult source words directly.
  • Google T5 / FLAN decoder blocks cross-attend over encoded input text for summarization and instruction tasks.
  • Salesforce BLIP caption generation attends from text tokens to image features or patches.
  • Encoder-decoder RAG stacks built on BART or T5 cross-attend from answer tokens to retrieved passage encodings.

Interview Q&A

Q: What is the formal difference between self-attention and cross-attention? A: In self-attention, Q, K, and V come from the same sequence. In cross-attention, Q comes from one sequence while K and V come from another. Common wrong answer to avoid: "Cross-attention just means a bigger attention matrix." The key change is source separation, not size. Q: Why is no causal mask needed on the encoder KV side? A: Because the source sequence is already fully available, so the decoder is allowed to inspect every encoder position. Common wrong answer to avoid: "Because cross-attention ignores order." No. Order still matters on both sides through their seat numbers. Q: Why can encoder K and V be reused across decode steps? A: Because the encoder output stays fixed for the whole source input, while only the decoder queries keep changing. Q: Give three practical uses of cross-attention. A: Translation, image captioning, and encoder-decoder retrieval-augmented generation are the cleanest examples.

Apply now (5 min)

Write one source row with 5 tokens. Write one target row with 3 tokens. Now invent a 3 x 5 scorecard. For each target token, mark the highest-weight source token. Then answer three checks: - which side produced Q? - which side produced K and V? - why is no causal mask needed on the source side? Sketch from memory: - the two-row bridge diagram - the 3 x 5 matrix shape - decoder block order: masked self-attention -> cross-attention -> FFN


Bridge. Cross-attention bridges two sequences. With self-attention, cross-attention, and all the tokenization machinery, this module is now complete. The next file admits what still feels unsolved. Read 14-honest-admission.md next.