11. The full pipeline — raw text to contextual vectors¶

Do not hold fragments separately now. Hold the whole conveyor belt together.

Built on the ELI5 in 00-eli5.md. The splitter — raw text becoming reusable chunks — now connects to the badge board, seat number, and spotlight beam in one flow.

Mental model — one sentence walks through many stations¶

A model never eats raw text directly. It sees a pipeline. Text enters. Vectors leave. Context reshapes those vectors along the way. See the whole belt first.

raw text
  -> the splitter
  -> token pieces
  -> token IDs
  -> the badge board
  -> base vectors
  +  the seat number
  -> position-aware vectors
  -> Q/K/V projections
  -> the spotlight beam
  -> the scorecard
  -> weighted sums
  -> concatenate heads
  -> output projection
  -> contextual vectors

That is the pipeline. Every stage changes representation.

Formula snapshot — the compact version¶

Write the full chain in symbols once.

text -> tokens -> ids
E = embedding(ids)
X = E + P
Q_i = X W_Q^(i)
K_i = X W_K^(i)
V_i = X W_V^(i)
head_i = softmax(Q_i K_i^T / sqrt(d_k)) V_i
Y = Concat(head_1, ..., head_h) W_O

If you remember only one line, remember this.

raw text -> ids -> embeddings + positions -> attention -> contextual vectors

Simple chain. Many moving parts.

Worked example setup — `The tokenizer reduced cost`¶

Take this sentence:

The tokenizer reduced cost

Step 1 — the splitter¶

Suppose the splitter breaks it as:

The | token | izer | reduced | cost

So the model sees five tokens. Not four words. Five reusable pieces.

Step 2 — token IDs¶

Now map each piece to an ID.

The     -> 11
token   -> 305
izer    -> 812
reduced -> 772
cost    -> 991

So the input ID list is:

[11, 305, 812, 772, 991]

That is what actually enters the neural stack.

Step 3 — the badge board lookup¶

Use a tiny toy embedding size of d_model = 4. The badge board returns one vector per ID.

E(The)     = [0.2, 0.1, 0.0, 0.3]
E(token)   = [0.8, 0.1, 0.4, 0.2]
E(izer)    = [0.7, 0.2, 0.5, 0.1]
E(reduced) = [0.1, 0.9, 0.3, 0.4]
E(cost)    = [0.0, 0.8, 0.6, 0.2]

ASCII view:

ID 11  -> drawer 11  -> [0.2, 0.1, 0.0, 0.3]
ID 305 -> drawer 305 -> [0.8, 0.1, 0.4, 0.2]
ID 812 -> drawer 812 -> [0.7, 0.2, 0.5, 0.1]
...

These are context-free vectors. They know the token identity. Not the sentence meaning yet.

Step 4 — add the seat numbers¶

Now add positional vectors. Use toy position vectors:

P1 = [0.01, 0.00, 0.00, 0.00]
P2 = [0.00, 0.01, 0.00, 0.00]
P3 = [0.00, 0.00, 0.01, 0.00]
P4 = [0.00, 0.00, 0.00, 0.01]
P5 = [0.01, 0.01, 0.00, 0.00]

Then:

X1 = E(The)     + P1 = [0.21, 0.10, 0.00, 0.30]
X2 = E(token)   + P2 = [0.80, 0.11, 0.40, 0.20]
X3 = E(izer)    + P3 = [0.70, 0.20, 0.51, 0.10]
X4 = E(reduced) + P4 = [0.10, 0.90, 0.30, 0.41]
X5 = E(cost)    + P5 = [0.01, 0.81, 0.60, 0.20]

Now the vectors know identity plus position. That matters. token and izer sit in different seats.

Step 5 — Q, K, V projections per head¶

Use h = 2 heads. So each head gets d_k = 2 in this toy example. Each head projects the same X differently. For token reduced, suppose the projections are:

head 1:
q1 = [0.9, 0.2]
k1(The)   = [0.2, 0.1]
k1(token) = [0.7, 0.1]
k1(izer)  = [0.6, 0.2]
k1(reduced)= [0.8, 0.3]
k1(cost)  = [0.1, 0.9]
head 2:
q2 = [0.4, 0.8]
k2(The)   = [0.3, 0.1]
k2(token) = [0.2, 0.6]
k2(izer)  = [0.1, 0.7]
k2(reduced)= [0.5, 0.6]
k2(cost)  = [0.7, 0.8]

Same token. Different question spaces.

Step 6 — the spotlight beam plus the scorecard¶

Now compute attention scores for query reduced.

Head 1 scores¶

Use dot products first.

q1·k1(The)     = 0.20
q1·k1(token)   = 0.65
q1·k1(izer)    = 0.58
q1·k1(reduced) = 0.78
q1·k1(cost)    = 0.27

Divide by sqrt(2) ≈ 1.41. Approximate scaled scores:

[0.14, 0.46, 0.41, 0.55, 0.19]

Softmax gives this scorecard:

head 1 weights for reduced:
The      0.16
token    0.22
izer     0.21
reduced  0.24
cost     0.17

Head 2 scores¶

Suppose the scaled scores are:

[0.14, 0.40, 0.44, 0.48, 0.59]

Softmax gives:

head 2 weights for reduced:
The      0.14
token    0.18
izer     0.19
reduced  0.20
cost     0.29

See the difference. Head 2 leans more toward cost. That fits the phrase reduced cost.

Step 7 — weighted sum of values¶

Give each token a toy value vector per head.

head 1 values:
v1(The)     = [0.1, 0.0]
v1(token)   = [0.7, 0.2]
v1(izer)    = [0.6, 0.3]
v1(reduced) = [0.5, 0.8]
v1(cost)    = [0.2, 0.9]

Now weight and sum them.

o1(reduced)
= 0.16*[0.1,0.0]
+ 0.22*[0.7,0.2]
+ 0.21*[0.6,0.3]
+ 0.24*[0.5,0.8]
+ 0.17*[0.2,0.9]
≈ [0.45, 0.43]

For head 2, use:

v2(The)     = [0.0, 0.1]
v2(token)   = [0.2, 0.6]
v2(izer)    = [0.1, 0.7]
v2(reduced) = [0.6, 0.4]
v2(cost)    = [0.8, 0.5]

Weighted sum:

o2(reduced)
≈ [0.39, 0.47]

ASCII picture:

reduced
  +--> head 1 summary [0.45, 0.43]
  +--> head 2 summary [0.39, 0.47]

Step 8 — concatenate heads and apply the output projection¶

Concatenate the two head outputs.

[o1 || o2] = [0.45, 0.43, 0.39, 0.47]

Now apply the output projection. For a toy picture, let W_O be identity.

Y(reduced) = [0.45, 0.43, 0.39, 0.47]

This is now a contextual vector. It still represents reduced. But now it also carries clues from token, izer, and cost. That is the end-to-end flow.

Production knobs — the pipeline gets expensive fast¶

Longer prompts cost quadratically¶

Attention score matrices grow like L x L. Double the prompt length. Roughly quadruple raw attention score work.

KV cache during generation¶

At decode time, store old keys and values. Then compute only the new query each step. That saves repeated work.

Attention maps are informative, not full explanations¶

The scorecard shows where a head looked. It does not tell the whole causal story of the final output.

Position extension affects usable context¶

If you stretch position handling badly, quality degrades. Longer context windows are not free magic.

Where this lives in the wild¶

GitHub Copilot turns raw code text into contextual vectors before predicting the next token.
OpenAI ChatGPT runs this same pipeline on prompts, tool traces, and generated text.
Anthropic Claude applies the chain over long business documents and chat history.
Google Gemini uses the pipeline across text-heavy prompts where position handling matters a lot.
Notion AI relies on contextual vectors to rewrite or summarize text from natural-language input.

Interview Q&A¶

Q: What changes between embeddings and contextual vectors? A: Embeddings know token identity. Contextual vectors also encode who mattered nearby. Common wrong answer to avoid: "Embeddings already contain the whole sentence meaning." No. Context comes later through attention. Q: Why do we add positional information before attention? A: Because attention alone does not know token order. Q: Where does the quadratic cost appear? A: In forming attention scores across many query-key pairs. Common wrong answer to avoid: "Only the embedding lookup is expensive." Lookup is cheap. Pairwise attention is the heavy part. Q: What is the shortest end-to-end summary? A: Split text, map IDs to vectors, add positions, run attention, produce contextual vectors.

Apply now (5 min)¶

Take the sentence The tokenizer reduced cost. Write the token split from memory. Invent five token IDs. Draw a tiny badge board lookup for two tokens. Add toy seat numbers. Then choose one query token. Write one scorecard row and one weighted sum. Sketch from memory: Draw the full conveyor belt from raw text to contextual vector.

Bridge. We have the clean pipeline with BPE. But BPE is not the only splitter — WordPiece and Unigram take different paths to the same goal. Read 12-wordpiece-unigram.md next.