Skip to content

04. Embeddings — the badge board turns IDs into geometry

The splitter gives IDs. Good. But IDs alone know nothing. Now the badge board begins.

Built on the ELI5 in 00-eli5.md. The badge board — the lookup table from token ID to learned vector — is what turns discrete pieces into geometry.


Mental picture

Imagine a large office cabinet. Each drawer has a badge number outside. Inside each drawer sits a learned card with several numeric slots. The badge number is just an address. Drawer 582 is not semantically close to drawer 583 merely because the numbers are adjacent. See. IDs are labels. Vectors are meaning-bearing coordinates. So when the splitter outputs token IDs, the model is not done. It still needs the badge board. The badge board says, "For ID 3, fetch row 3." That row is the embedding vector.

token text -> token ID -> row lookup -> dense vector

Simple, no? The ID is like a library index number. The vector is the actual card you read.

Formula first

Let the embedding table be: E shape = [V, d_model] Where: V = vocabulary size d_model = embedding width If token ID is i, then its embedding is: x_i = E[i] That is just row lookup. No multiplication is required conceptually. Later layers consume the resulting vectors. If a sequence of IDs is [i1, i2, i3], the output is: [E[i1], E[i2], E[i3]] So the badge board converts a list of integers into a matrix of vectors.

Why IDs alone are not meaning

Suppose token IDs are these: dog -> 17 cat -> 402 quantum -> 18 The numbers themselves say nothing useful. 17 and 18 look close numerically. But dog and quantum are not close semantically. Meanwhile dog and cat may be semantically related, even though 17 and 402 are far apart. So meaning cannot live in the integer value alone. It must live in learned vector positions.

Worked example — a tiny badge board

Take a toy vocabulary of size V = 6. Let d_model = 4. So the table shape is: E in R^(6 x 4) Use these rows.

ID   token      embedding row
0    [PAD]      [ 0.0,  0.0,  0.0,  0.0]
1    chat       [ 0.8,  0.1, -0.2,  0.4]
2    gpt        [ 0.7,  0.2, -0.1,  0.5]
3    price      [-0.4,  0.9,  0.3,  0.1]
4    token      [ 0.6,  0.0, -0.3,  0.6]
5    rupee      [-0.5,  1.0,  0.2,  0.0]

Now look up IDs: [1, 2, 3, 4] The output vectors are:

ID 1 -> [ 0.8,  0.1, -0.2,  0.4]
ID 2 -> [ 0.7,  0.2, -0.1,  0.5]
ID 3 -> [-0.4,  0.9,  0.3,  0.1]
ID 4 -> [ 0.6,  0.0, -0.3,  0.6]

Stacked together, the sequence matrix is:

[ 0.8,  0.1, -0.2,  0.4 ]
[ 0.7,  0.2, -0.1,  0.5 ]
[-0.4,  0.9,  0.3,  0.1 ]
[ 0.6,  0.0, -0.3,  0.6 ]

That matrix is what moves forward. Not the raw IDs.

One-hot equivalence

You may also see embeddings described with one-hot vectors. That view is mathematically neat. Suppose ID 3 means price. Its one-hot vector in a six-word vocabulary is: [0, 0, 0, 1, 0, 0] Now multiply by E. one_hot(3) @ E = E[3] Why? Because the one-hot row selects exactly one table row. See the equivalence. Row lookup and one-hot matrix multiplication produce the same result. The lookup description is just more practical.

Tiny numerical proof

Take the row for ID 2. One-hot vector: [0, 0, 1, 0, 0, 0] Multiply with the table. Only row 2 survives. So: [0, 0, 1, 0, 0, 0] @ E = [0.7, 0.2, -0.1, 0.5] Simple, no? The one-hot did not create meaning. It only selected a drawer. The drawer contents carried the information.

ASCII cabinet picture

+---------------- badge board ----------------+
| drawer 0 -> [PAD]   -> [ 0.0, 0.0, 0.0, 0.0]|
| drawer 1 -> chat    -> [ 0.8, 0.1,-0.2, 0.4]|
| drawer 2 -> gpt     -> [ 0.7, 0.2,-0.1, 0.5]|
| drawer 3 -> price   -> [-0.4, 0.9, 0.3, 0.1]|
| drawer 4 -> token   -> [ 0.6, 0.0,-0.3, 0.6]|
| drawer 5 -> rupee   -> [-0.5, 1.0, 0.2, 0.0]|
+---------------------------------------------+
                 ^
                 |
            token IDs point here

That upward arrow matters. IDs point. Vectors describe.

Why embeddings are learned

At the start of training, rows are usually random or lightly initialized. They do not know language yet. During training, gradients update the table. Useful directions get strengthened. Unhelpful directions get corrected. So the drawer contents improve over time. If two tokens behave similarly across many contexts, their vectors often move closer. If two tokens play very different roles, their vectors often move apart. This is why geometry becomes meaningful. Not because row numbers were clever. Because learning shaped the space.

Nearby vectors and semantic neighborhoods

In a trained model, related tokens often land in nearby regions. chat and gpt may not become identical. But they can share some dimensions. price and rupee may also become closer than price and [PAD]. Do not make this too mystical. Nearby does not mean synonym only. It can also mean similar usage, syntax, or domain role. Still, the core intuition is right. The badge board turns lookup IDs into geometry that later layers can manipulate.

One sequence walk-through

Suppose the splitter outputs: [1, 2, 5, 4] This might correspond to: chat | gpt | rupee | token The badge board returns four rows. Then the seat number is added later. Then attention compares the resulting vectors. So embeddings are the first dense representation stage. Without them, attention would be operating on raw symbols with no geometry. That would not work.

Common confusion to remove

Embeddings are not static dictionary definitions. They are trainable parameters. They are also not the final meaning of a token in context. The embedding for bank is one starting point. After attention, the contextual vector for bank can shift toward river or finance. So think of embeddings as the badge card you receive at entry. Contextual representations are the conversation that follows.

Where this lives in the wild

  • OpenAI ChatGPT: token IDs are looked up into dense vectors before any attention step begins.
  • GitHub Copilot: code tokens and symbols get embeddings before the model reasons across a file.
  • Google Translate and Gemini: multilingual tokens need shared geometry, not just integer labels.
  • Amazon and Shopify search ranking models: product text features often start as embedding lookups.
  • Spotify and YouTube recommenders: IDs for items, users, and contexts are often embedded into vector spaces.

Interview Q&A

Q: Why can't we feed token IDs directly into attention? A: Because raw integers do not encode semantic geometry. Attention needs dense vectors it can compare meaningfully. Common wrong answer to avoid: "Because transformers only accept floating-point numbers." Q: What is the shape of an embedding table? A: Usually [V, d_model], where V is vocabulary size and d_model is embedding width. Q: What does one_hot @ E represent? A: It is another way to express row lookup from the embedding table. Common wrong answer to avoid: "It combines all rows to make a new token meaning." Q: Are embeddings the final contextual meaning of a token? A: No. They are the starting dense vectors. Attention and later layers contextualize them further.

Apply now (5 min)

Make a toy vocabulary of five tokens from your domain. Choose d_model = 3. Write a tiny embedding table by hand. Pick a sequence of three IDs. Look up the rows and stack them into a matrix. Then show one token as a one-hot vector and verify one_hot @ E selects the same row. Sketch from memory: draw the cabinet with drawer numbers on the left and vectors on the right.


Bridge. Now every token has a vector, but the model still does not know order. The next missing piece is the seat number. Read 05-positional-encoding.md.