13. Cross-attention — one sequence consulting another¶

Self-attention is room talk. Cross-attention lets one row query a second row.

Built on the ELI5 in 00-eli5.md. The spotlight beam — one token checking who matters — now shines across two different rows instead of staying inside one.

The picture before the math¶

See. Self-attention is one classroom. Cross-attention is two classrooms with a glass window. One row holds the source information. Another row holds the unfinished output. The output row asks questions. The source row supplies memory. So the spotlight beam crosses the room boundary.

encoder row : e1  e2  e3  e4  e5
decoder row : d1  d2  d3
d2 spotlight ----> e1
               ---> e3
               ---> e5

That is the whole picture.

Self-attention vs cross-attention¶

In self-attention, queries, keys, and values all come from the same sequence.

self-attention
Q <- X
K <- X
V <- X

In cross-attention, queries come from one sequence, but keys and values come from another.

cross-attention
Q <- decoder sequence
K <- encoder sequence
V <- encoder sequence

So the scorecard asks a different question. Self-attention asks: "which tokens in my own row matter to me?" Cross-attention asks: "which tokens in that other row matter to me?" Simple, no?

Why we need this extra bridge¶

Suppose we are translating. Source sentence: the red car is fast Target sentence is still being generated. If the decoder is about to write voiture, it needs the source token car. That clue does not live inside the partially generated target row. It lives in the source row. So masked self-attention alone is not enough. We need a second spotlight beam. That beam goes from decoder to encoder. This is the encoder-decoder bridge.

One clean block picture¶

In a classic decoder block, the order is:

decoder states
   |
   v
masked self-attention
   |
   v
cross-attention over encoder outputs
   |
   v
feed-forward network

Why this order? First, the decoder organizes what it has generated so far. Then, it consults the source memory. Then, it transforms the mixed result further. That is the usual rhythm.

Picture before matrix shapes¶

Keep two rectangles in mind.

encoder output E : 5 x d
decoder state D  : 3 x d

Now project them.

Q = D W_Q   -> 3 x d_k
K = E W_K   -> 5 x d_k
V = E W_V   -> 5 x d_v

Then the scorecard is:

scores = Q K^T -> 3 x 5

Why 3 x 5? Three decoder queries. Five encoder memory slots. Each decoder token scores every encoder token. Then:

weights = row_softmax(scores)
output  = weights V -> 3 x d_v

So each decoder row gets a weighted mix of encoder values.

ASCII diagram — where Q, K, and V come from¶

decoder states (3 x d) --W_Q--> Q (3 x d_k)
                               |
                               v
                        scores QK^T (3 x 5)
                               ^
                               |
encoder states (5 x d) --W_K--> K (5 x d_k)
encoder states (5 x d) --W_V--> V (5 x d_v)
weights = softmax(scores by row)
output  = weights x V

That is cross-attention in one picture.

Worked example — 3 decoder queries, 5 encoder outputs¶

Use a tiny translation-style setup. Encoder tokens:

e1 = the
e2 = red
e3 = car
e4 = is
e5 = fast

Decoder has three query positions:

d1 = la
d2 = voiture
d3 = rapide

Suppose the raw scorecard is:

             e1    e2    e3    e4    e5
d1 "la"      2.2   0.2   0.8  -0.5  -0.8
d2 "voiture" 0.1   0.5   2.3   0.2  -0.4
d3 "rapide" -0.7   0.3   0.2   1.1   2.4

After row-wise softmax, imagine we get:

d1 -> [0.63, 0.08, 0.18, 0.06, 0.05]
d2 -> [0.07, 0.11, 0.64, 0.10, 0.08]
d3 -> [0.03, 0.08, 0.07, 0.25, 0.57]

Read the rows slowly. d1 = la looks mostly at the. Good. d2 = voiture looks mostly at car. Very good. d3 = rapide looks mostly at fast, with some help from is. Also sensible. So the scorecard shape is:

3 decoder queries x 5 encoder memories

Not square. That is the first visual clue that this is cross-attention.

One weighted-sum calculation¶

Now give the encoder values tiny 2D vectors.

the  -> [1, 0]
red  -> [2, 1]
car  -> [5, 1]
is   -> [0, 2]
fast -> [1, 5]

Use the d2 = voiture weights:

[0.07, 0.11, 0.64, 0.10, 0.08]

Compute the mixed value.

o2
= 0.07[1,0] + 0.11[2,1] + 0.64[5,1] + 0.10[0,2] + 0.08[1,5]
= [0.07,0.00] + [0.22,0.11] + [3.20,0.64] + [0.00,0.20] + [0.08,0.40]
= [3.57, 1.35]

See what happened. The output for voiture mostly copies semantic mass from car, not from every source token equally. That is the bridge working.

How cross-attention differs from self-attention in practice¶

1) Q comes from the decoder side¶

The query changes at each target position. So each partially generated token asks a different question.

2) K and V come from the encoder side¶

The encoder row is already finished. Its badge-board vectors and seat numbers are stable for this example. So we compute encoder K and V once. During decoding, we keep reusing them.

3) No causal mask is needed on the KV side¶

Why? Because the source sequence is fully known already. The model is not cheating by looking at source token 5. Source token 5 is supposed to be visible. Important nuance: the decoder still needs causal masking inside its own self-attention block. So: no future peeking in the target row, but full visibility over the source row.

4) The attention matrix is usually rectangular¶

Self-attention inside one sequence often gives L x L. Cross-attention gives:

target_length x source_length

That alone tells you two different rows are involved.

Three clean use cases¶

Translation¶

Decoder query: "what source word helps me write the next target word?" That is the classic encoder-decoder bridge.

Image captioning¶

Encoder memory: image patches or visual regions. Decoder query: "which patch matters for the next word in the caption?" So dog may attend to the dog's region. grass may attend somewhere else.

Retrieval-augmented generation¶

In encoder-decoder RAG, retrieved passages become the encoder memory. The answer decoder asks: which retrieved token or sentence helps me write this next word? So the spotlight beam crosses from answer tokens to retrieved evidence.

One more compact comparison¶

self-attention
--------------
query token asks its own sequence
cross-attention
---------------
query token asks a different sequence

If the query row and memory row are the same, it is self-attention. If the query row and memory row differ, it is cross-attention. That is the entire test.

Where this lives in the wild¶

Google Translate uses decoder queries over source-language encoder states to write the target sentence.
Microsoft Translator uses the same bridge so each target token can consult source words directly.
Google T5 / FLAN decoder blocks cross-attend over encoded input text for summarization and instruction tasks.
Salesforce BLIP caption generation attends from text tokens to image features or patches.
Encoder-decoder RAG stacks built on BART or T5 cross-attend from answer tokens to retrieved passage encodings.

Interview Q&A¶

Q: What is the formal difference between self-attention and cross-attention? A: In self-attention, Q, K, and V come from the same sequence. In cross-attention, Q comes from one sequence while K and V come from another. Common wrong answer to avoid: "Cross-attention just means a bigger attention matrix." The key change is source separation, not size. Q: Why is no causal mask needed on the encoder KV side? A: Because the source sequence is already fully available, so the decoder is allowed to inspect every encoder position. Common wrong answer to avoid: "Because cross-attention ignores order." No. Order still matters on both sides through their seat numbers. Q: Why can encoder K and V be reused across decode steps? A: Because the encoder output stays fixed for the whole source input, while only the decoder queries keep changing. Q: Give three practical uses of cross-attention. A: Translation, image captioning, and encoder-decoder retrieval-augmented generation are the cleanest examples.

Apply now (5 min)¶

Write one source row with 5 tokens. Write one target row with 3 tokens. Now invent a 3 x 5 scorecard. For each target token, mark the highest-weight source token. Then answer three checks: - which side produced Q? - which side produced K and V? - why is no causal mask needed on the source side? Sketch from memory: - the two-row bridge diagram - the 3 x 5 matrix shape - decoder block order: masked self-attention -> cross-attention -> FFN

Bridge. Cross-attention bridges two sequences. With self-attention, cross-attention, and all the tokenization machinery, this module is now complete. The next file admits what still feels unsolved. Read 14-honest-admission.md next.