Skip to content

09. Encoder, decoder, encoder-decoder — three factory layouts

Three layouts. One assembly line idea. Different visibility rules. See the trade-offs clearly.

Built on the ELI5 in 00-eli5.md. The assembly line — the full transformer — now splits into three factory layouts with three different jobs.


Mental model — three ways to arrange the factory

Same core parts. Tokens. Residual stream. The station. The social bench. The private bench. The shortcut pipe. The quality inspector. What changes is the visibility rule. Who may look at whom? And at what moment? That one choice creates three common layouts. Encoder-only says every token may look both left and right. So each token becomes a rich representation of the whole sentence. This is excellent for understanding. Decoder-only says each token may look only left. So the model learns to extend a prefix one token at a time. This is excellent for generation. Encoder-decoder splits reading and writing. The encoder reads the source fully. The decoder writes the target autoregressively. Cross-attention lets the writer consult the reader. See the picture first. Then the math feels obvious.

encoder-only
input -> full self-attention stations -> contextual token states

decoder-only
prefix -> masked self-attention stations -> next-token logits

encoder-decoder
source -> encoder -> source memory
prefix -> decoder -> next-token logits
                    ^
                    |
              cross-attention


The three layouts at a glance

Layout            Can a token see?                Best at
-----------------------------------------------------------------
Encoder-only      left + self + right             understanding
Decoder-only      left + self only                generation
Encoder-decoder   source fully, target leftward   source -> target
So what to remember? Encoder-only builds representations. Decoder-only grows continuations. Encoder-decoder performs transformation. That is the shortest correct summary.


Core formulas — the social bench changes only a little

The base attention rule is still this:

Attention(Q, K, V) = softmax((QK^T / sqrt(d_k)) + M) V
M is the visibility rule. That is where the layouts differ.

Encoder-only self-attention

M = 0
No future blocking. No special source-target split. Every token may consult every other token.

Decoder-only self-attention

M[i, j] = 0      if j <= i
M[i, j] = -inf   if j > i
That is the causal mask. Lower triangle allowed. Upper triangle blocked.

Encoder-decoder

The decoder has two social benches. First, masked self-attention on the target prefix. Then cross-attention into encoder outputs.

CrossAttention(Q_dec, K_enc, V_enc)
= softmax(Q_dec K_enc^T / sqrt(d_k)) V_enc
Decoder queries come from target tokens. Encoder keys and values come from source tokens. So yes, the third layout is just one more window. That extra window changes the job completely.


One sentence, three layouts

Use one sentence everywhere:

The bank will close at five.
Now process it three ways.

Encoder-only

The token bank sees close and five. So it can infer "financial institution" instead of "river edge". The final vectors are used for understanding tasks like classification or embeddings.

Decoder-only

At token close, the model has only seen:

[The] [bank] [will] [close]
It has not seen at five yet. So it must predict the future honestly from left context. That is exactly the training game needed for generation.

Encoder-decoder

The encoder reads the full source sentence. Then the decoder may generate a target like:

Bank closes at five.
or a translation. or a summary. The source is fully visible to the encoder. The target is still generated one token at a time. Same text. Three different jobs.


Worked numerical example — encoder-only

Take three source tokens:

t1 = bank
t2 = close
t3 = five
Suppose the query for bank produces these scores:
scores(bank -> all) = [1.0, 2.0, 2.5]
Because this is encoder-only, all three are legal. Softmax over all three gives:
exp = [2.72, 7.39, 12.18]
sum = 22.29
weights ≈ [0.12, 0.33, 0.55]
ASCII picture:
bank query
  +--> bank   0.12
  +--> close  0.33
  +--> five   0.55
So the representation for bank leans heavily on right context. That is why encoder-only models are strong for understanding.


Worked numerical example — decoder-only

Use the same order. Now imagine position 2, token close. Its raw scores are:

scores(close -> [bank, close, five]) = [1.0, 2.0, 4.0]
But five is in the future. So masking changes the logits to:
[1.0, 2.0, -inf]
Softmax now becomes:
exp = [2.72, 7.39, 0]
sum = 10.11
weights ≈ [0.27, 0.73, 0.00]
ASCII picture:
close query
  +--> bank   0.27
  +--> close  0.73
  X--> five   blocked
See. The model cannot peek at the answer token. So it learns genuine next-token prediction.


Worked numerical example — encoder-decoder

Source side:

[The] [bank] [will] [close] [at] [five]
Target side we want:
[Bank] [closes] [at] [five]
Suppose the decoder is generating closes. Masked self-attention reads the target prefix:
[Bank] [closes]
Then cross-attention looks into encoded source states. Toy cross-attention scores are:
[0.2, 1.5, 0.8, 2.4, 1.0, 0.6]
Approximate softmax weights:
≈ [0.05, 0.18, 0.09, 0.44, 0.12, 0.07]
ASCII picture:
closes query
  +--> The    0.05
  +--> bank   0.18
  +--> will   0.09
  +--> close  0.44
  +--> at     0.12
  +--> five   0.07
The decoder is writing target text. But it is reading a fully processed source memory. That is why translation and summarization fit this layout so naturally.


Why decoder-only dominates modern LLMs

First, the objective scales cleanly. Next-token prediction works on web text, code, chat logs, tool traces, and synthetic rollouts. One pretraining recipe covers many product surfaces. Second, the architecture is simpler. One stack. One cache story. One deployment loop. No separate encoder tower and decoder tower. Third, product demand favored open-ended generation. Users want answers, drafts, code, plans, edits, tool calls, and follow-up turns. Decoder-only is built for continuation. Fourth, fine-tuning stays aligned with pretraining. Instruction tuning still teaches the model to continue a prompt. Preference training still shapes continuations. Same causal interface throughout. Fifth, every parameter can be spent on the generator itself. An encoder-decoder splits budget across reader, writer, and cross-attention. A decoder-only model spends almost all capacity on the thing users see. So yes, decoder-only won the general assistant race. Not because the other layouts are obsolete. Because generation became the dominant product requirement.

When encoder-decoder is still better

Some tasks have a clean source and a clean target. There the split reader-writer design is a strength. Typical wins: - translation - summarization - speech-to-text - document rewriting - data-to-text generation Why can it help? The encoder can read the full source before the decoder writes anything. Cross-attention gives the decoder direct access to that full source memory. The target-side mask still keeps generation orderly. So what to do? If the task is broad open-ended continuation, default to decoder-only. If the task is explicit source-to-target transformation, encoder-decoder may still be the cleaner choice.


Parameter sharing and scaling notes

These layouts also differ in how parameters get allocated. Encoder-only spends most capacity on bidirectional understanding blocks. Decoder-only spends most capacity on causal generation blocks. Encoder-decoder splits parameters across two towers and adds cross-attention projections. Sharing can reduce the cost. Common choices are: - tying token embeddings with the output softmax - sharing source and target embeddings when vocabularies match - using repeated layers only in special designs like ALBERT-style models T5 is a good example to remember. It shares token embeddings. But encoder blocks and decoder blocks still serve different jobs. Full weight tying between both towers is uncommon. The trade-off is simple. Sharing saves parameters. Specialization usually gives better behavior. Large production models normally prefer specialization unless memory budgets force otherwise.


Where this lives in the wild

  • BERT and related encoder-only models remain strong for search ranking, classification, tagging, and embedding generation because full bidirectional context helps understanding.
  • GPT-style assistants from OpenAI are decoder-only because the shipped product is always prompt continuation over a growing conversation.
  • Claude-style assistants from Anthropic use the same decoder-only pattern for chat, code, and agentic generation.
  • T5 and BART families use encoder-decoder layouts for summarization and source-to-target generation tasks.
  • Machine translation products like Google Translate fit encoder-decoder naturally because source and target are separate streams linked by cross-attention.

Interview Q&A

Q: Why is BERT called encoder-only? A: Because its blocks are bidirectional self-attention over the input sequence, and the model outputs contextual representations rather than an autoregressive decoding stream. Common wrong answer to avoid: "Because it reads only left to right." No. BERT reads both left and right context. Q: Why did decoder-only architectures dominate modern LLMs? A: Because one causal next-token objective scales across text, code, chat, and tool trajectories, while the same architecture works for pretraining, instruction tuning, and deployment. Simpler stack. Cleaner product fit. Q: When would you still pick encoder-decoder? A: When the problem is clearly source-to-target and the full source should be read before generation begins, as in translation, summarization, or speech-to-text. Common wrong answer to avoid: "Never. Decoder-only is always better." Decoder-only is more general, not universally superior. Q: What extra mechanism does encoder-decoder add? A: Cross-attention. The decoder forms queries from the target prefix and reads keys and values from the encoded source states.


Apply now (5 min)

Take one sentence:

The server will restart at midnight.
Do three sketches. 1. Draw encoder-only and mark that every token sees every token. 2. Draw decoder-only and mark the causal mask on restart. 3. Draw encoder-decoder and add the cross-attention arrow from source memory to target decoder. Then sketch from memory the three formulas: - full self-attention - masked self-attention - cross-attention If you can explain in one minute why bank is easier to disambiguate in encoder-only but generation is easier in decoder-only, you own this file.


Bridge. Decoder-only models dominate only because the future-blocking rule keeps training honest. The next file turns that rule into a matrix you can inspect line by line. Read 10-causal-mask.md next.