05. Embeddings — index cards for meaning¶
~12 min read. A user types "refund" into a billing chatbot. The right document says "money back." With a good embedder, those two land in the same neighbourhood. With a bad embedder, the bot returns a page on tax refunds. Same word, wrong meaning, wrong answer. This file explains why the neighbourhood matters more than the word.
Built on the chunking chapter in 04-chunking-strategies.md. The index card from the ELI5 is exactly the embedding — a compact meaning signature attached to each chunk on the bookshelf.
1) The hook — when geometry helps and when it lies¶
Two real moments.
Where it helps. A Stripe support bot indexes its policy docs with text-embedding-3-small. A user types "can I get my money back?" The query embeds to a vector very close to a chunk that says "refunds are issued within 7 business days." The user never typed the word "refund." Geometry saved the day.
Where it fails. A finance research tool indexes earnings calls. A user asks "how is Apple doing?" The embedder, trained on general web text, places the query near a chunk about Apple Inc's iPhone shipments. Good. But the same embedder also places a chunk about apple orchard yields uncomfortably close to certain queries. The corpus has only one Apple. The embedder still drags fruit-meaning into the geometry. Polysemy did not collapse. It bled in.
Same model. Different corpus. Different outcome. Embeddings encode the world the model was trained on. Not the world your documents live in.
2) What an embedding actually is¶
An embedding is a vector. A fixed-length list of floating-point numbers. The model reads your text, runs it through a transformer, and emits one vector per chunk.
Typical sizes in 2026.
| Embedder | Default dim | Provider |
|---|---|---|
| OpenAI text-embedding-3-small | 1536 | OpenAI |
| OpenAI text-embedding-3-large | 3072 | OpenAI |
| Cohere embed-v3-english | 1024 | Cohere |
| Voyage AI voyage-3-large | 1024 | Voyage |
| Google text-embedding-005 | 768 | |
| Google gemini-embedding-001 | 3072 | |
| Mistral Embed | 1024 | Mistral |
| Jina embeddings v3 | 1024 | Jina |
| BAAI bge-m3 | 1024 | BAAI |
| BAAI bge-large-en-v1.5 | 1024 | BAAI |
| Intfloat e5-large-v2 | 1024 | Microsoft Research |
| NV-Embed-v2 | 4096 | NVIDIA |
| Stella v5 1.5B | 1024 | Open source |
| GTE-large-en-v1.5 | 1024 | Alibaba |
See the spread. 768 to 4096. That spread matters.
Think of each dimension as one axis on which the model learned to vary text. Some axes might track topic. Some track formality. Some track tense. Most are not interpretable by humans. The model decides what each dimension means during training. You inherit the geometry.
The card is not the book. Repeat that line. An embedding captures enough meaning to find candidates. It does not preserve every clause, every number, every negation. Numbers blur. Negation often blurs. Rare jargon blurs. That is why retrieval is "approximate" and reranking exists.
Mini-FAQ. "Why 768, 1024, 3072 dimensions and not round numbers like 1000?" Transformer hidden sizes are picked for hardware efficiency — multiples of 64 or 128 align with GPU tensor cores. 768 is BERT-base's hidden size. 1024 is BERT-large. 1536 and 3072 are OpenAI's choices that fit nicely into 16-byte SIMD lanes. The numbers are engineering, not magic.
3) The librarian's index card — the metaphor that holds¶
Return to the library frame from the ELI5.
The chunk is the book page. The vector is the index card stapled to that page. The card carries a meaning-signature. Not the words. A pattern of numbers that places the page on a giant invisible map.
The bookshelf is the vector store. Pages with similar cards sit in the same neighbourhood. The librarian does not read every page. She looks at the user's question, makes a card for it the same way, and walks straight to the matching neighbourhood.
One rule, do not forget. Embeddings are for finding. Original text is for answering. If your prompt sends the vector to the LLM instead of the chunk text, you have lost the book and kept the card. Useless.
4) The running worked example — four product chunks¶
Hold these four chunks in your head for the rest of the chapter.
A: "Enterprise refunds require CFO approval within 10 business days."
B: "Finance sign-off is needed before an enterprise refund is processed."
C: "GPU training jobs are billed per minute with no refund window."
D: "Our React SDK supports lazy loading of dashboard components."
A and B paraphrase each other. C shares the word "refund" but lives in a different topic (billing, not policy). D is unrelated.
A good embedder produces something like this cosine matrix.
| Pair | Cosine | Reading |
|---|---|---|
| A vs B | 0.91 | tight paraphrase |
| A vs C | 0.34 | weak share — same word, different topic |
| A vs D | 0.04 | unrelated |
| B vs C | 0.29 | weak share |
| B vs D | 0.05 | unrelated |
| C vs D | 0.11 | unrelated |
See the structure. The model has learned that "CFO approval" and "finance sign-off" are near-synonyms. It has learned that the word "refund" alone is not enough to pull C close to A. Topic dominates the vocabulary overlap.
Plot the cluster centres in 2D (the model uses 1024 dims, but you and I cannot draw 1024).
meaning-space (2D projection)
formal ▲
│
│ A ●
│ ╱
│ ╱ ● B (refund policy cluster)
│ ╱
│
│
│
│ ● C (billing-rate cluster)
│
│
│ ● D (frontend SDK)
│
casual └────────────────────────────────────────────►
policy/finance engineering
A and B are tight. C is in a neighbouring region — same broad domain (money, products), different sub-topic. D is far. Now imagine the user query: "who has to sign off on a refund for an enterprise customer?" Its embedding lands near the A-B pair. The retriever returns A and B. C and D never enter the prompt. Done well.
5) What embeddings encode — and what they do not¶
They encode. - Topic and domain — engineering vs finance vs medical. - Paraphrase — "money back" ≈ "refund." - Intent shape — "how do I cancel" sits near "cancellation policy." - Related sentiment and tone at coarse level. - Multilingual alignment, if the model was trained for it — "facture" (French) sits near "invoice."
They struggle with. - Exact numbers. "30 days" and "60 days" embed almost identically. The model sees "duration token in policy context," not the arithmetic. - Negation. "Refunds allowed" and "refunds not allowed" can cosine at 0.85 or higher. The "not" is one token among many. - Rare proper nouns. A new internal product code like "Project Halberd" is just noise to a general embedder. No training signal. - Polysemy. "Apple" the company vs "apple" the fruit. A generic embedder may place both in a fuzzy middle. A finance-tuned embedder pulls company-meaning sharper. - Numeric reasoning. "Q3 grew 12%" and "Q3 shrank 12%" embed close. Direction is fragile.
Mini-FAQ. "If embeddings blur numbers and negation, why use them?" Because raw keyword match misses paraphrase, which is most of real user language. The right play is hybrid retrieval — BM25 plus dense embedding, scores fused with RRF. Embeddings cover paraphrase. BM25 covers exact terms and numbers. Together they cover both.
6) The failure mode you must never cause — model mismatch¶
Here is the rule that ends careers if broken.
The embedder that indexes your documents must be the same embedder that embeds the query at retrieval time. Same model. Same version. Same dimension. Same pooling.
Why. Because each model learns its own geometry. OpenAI's text-embedding-3-large and Cohere's embed-v3 both produce 1024-or-larger vectors. The vectors are not comparable. They live in different spaces. Cosine similarity across spaces is meaningless. You will get numbers. They will look reasonable. They will be garbage.
text-embedding-3-large space cohere embed-v3 space
A ● A ●
B ●
B ●
C ● C ●
D ● D ●
These two spaces are not rotations
of each other. They are unrelated.
Cosine between vectors from
different spaces = noise.
Embed once with one model. That is the rule. When you upgrade the embedder, you re-embed the entire corpus. There is no shortcut.
Production stacks that broke this rule and paid for it: teams that switched from ada-002 to text-embedding-3-small mid-way and forgot to re-index. Teams that A/B tested two embedders against the same vector store. Teams that quietly upgraded a Cohere model version. All of them produce a retrieval system that looks fine in unit tests and drifts in production.
Mini-FAQ. "What does normalizing the vector do, and when?" Normalization scales the vector to unit length (L2 norm = 1). After normalization, dot product equals cosine similarity. Most vector stores (Pinecone, Qdrant, pgvector with
vector_cosine_ops) expect normalized vectors when you use cosine. Some models (OpenAI's) return normalized vectors by default. Some (older BGE checkpoints) do not. Check the model card. If you use dot product on un-normalized vectors, longer texts get unfair advantage — their vectors are bigger, dot product is bigger.
7) Predict the cosine pattern before reading the strengths and weaknesses¶
Stop. Before reading further, predict three things.
1. Will reducing 3072 dims to 256 dims preserve retrieval quality for our four example chunks A, B, C, D?
2. If you index docs in English with a multilingual embedder and query in French, will it work?
3. If you change from text-embedding-3-small to text-embedding-3-large, can you keep the same vector store?
Write your guesses. Then continue.
8) Dimension downcasting — Matryoshka and the cliff¶
Embedding cost grows linearly with dimension. Storage too. A 3072-dim vector at fp32 is 12 KB. A million chunks at 3072-dim fp32 = 12 GB before any index overhead. So teams downcast.
Two approaches.
Naive truncation. Take the first 256 of the 3072 dimensions. Sometimes works, often fails — the model never promised the first 256 dims were the important ones.
Matryoshka Representation Learning (MRL). Train the model so that the first 256 dims are themselves a good embedding. Then the first 512. Then 1024. Each prefix is usable. OpenAI's text-embedding-3 family supports MRL — pass dimensions=256 and you get a real 256-dim vector, not a truncated one. Voyage, Jina v3, and Nomic also ship MRL variants.
3072-dim full 1024-dim MRL 256-dim MRL 128-dim MRL
████████████ ████ ██ █
recall@10: recall@10: recall@10: recall@10:
0.78 (baseline) 0.76 0.71 0.62
storage / 1M docs: storage / 1M: storage / 1M: storage / 1M:
12 GB 4 GB 1 GB 0.5 GB
Numbers above are illustrative for MS MARCO-style benchmarks. Real numbers depend on your corpus and your queries. Always measure on your data before committing.
For our running A-B-C-D example, dropping from 3072 to 256 dims keeps A-B close to each other. The cosine might fall from 0.91 to 0.88. Still tight enough. Drop to 64 dims and A-C may start crossing — the model loses the resolution to separate paraphrase from weak overlap. The cliff is real but it is far below where most teams sit.
9) Domain-tuned embedders — when generic is not enough¶
Generic embedders are trained on Common Crawl, Wikipedia, and assorted web text. They are excellent on general English. They are mediocre on jargon-heavy corpora.
Three families of fixes.
Off-the-shelf domain embedders.
- voyage-code-3 for code-heavy retrieval. Out-performs general embedders on GitHub issues, Stack Overflow, repo search.
- voyage-law-2 for legal documents. Trained on contracts and case law.
- voyage-finance-2 for SEC filings, earnings calls, analyst reports.
- SPECTER2 / SciNCL for academic papers.
- BioBERT and PubMedBERT for biomedical text.
Fine-tuned open-source embedders. Take bge-m3 or e5-large-v2, fine-tune on your own query-document pairs with contrastive loss. Costs a few hundred dollars of GPU time. Often gives 10-20 percentage points of recall@10 on niche domains.
Late-interaction models. ColBERT v2 and its modern descendants (ColBERTv2.5, ConstBERT). Instead of one vector per chunk, store one vector per token, and compare token-by-token at query time. Slower. Sharper on jargon. Used in patent search, legal discovery, and some scientific search engines.
When does domain tuning pay off? Roughly when your corpus has more than 30% of tokens that the generic embedder rarely saw during pretraining. Medical, legal, scientific, internal product codes, code itself.
Mini-FAQ. "What is the difference between sentence embeddings and token embeddings?" A token embedding is the vector for one wordpiece — that lives inside the model. A sentence embedding is one vector for the whole input — usually produced by mean-pooling or CLS-pooling the token embeddings, then projecting. RAG retrieval uses sentence-level (or chunk-level) embeddings because you compare whole texts. ColBERT and other late-interaction models break the rule — they keep all token vectors and compare at the token level, which costs more storage and more compute but recovers fine-grained matching.
10) Numbers that matter — cost, latency, scale¶
All figures below are list prices as of late 2025 / early 2026. They drift. Confirm before quoting.
| Embedder | $/1M tokens | Latency p50 (per call) | Dims |
|---|---|---|---|
| OpenAI text-embedding-3-small | $0.02 | 80-150 ms | 1536 |
| OpenAI text-embedding-3-large | $0.13 | 150-300 ms | 3072 |
| Cohere embed-v3-english | $0.10 | 100-200 ms | 1024 |
| Voyage voyage-3-large | $0.18 | 200-400 ms | 1024 |
| Google text-embedding-005 | $0.025 | 80-180 ms | 768 |
| Mistral Embed | $0.10 | 100-200 ms | 1024 |
| Jina embeddings v3 (API) | $0.018 | 100-250 ms | 1024 |
| Self-hosted bge-m3 on a T4 GPU | infrastructure only | 5-20 ms (batched) | 1024 |
| Self-hosted e5-large-v2 on A10 | infrastructure only | 4-15 ms (batched) | 1024 |
A few real-world budgets.
- 1M chunks at 500 tokens each = 500M tokens. Indexing with
text-embedding-3-small: ~$10. Withtext-embedding-3-large: ~$65. - 10,000 queries/day at 50 tokens each = 500K tokens/day = 15M/month. Query-time cost with
text-embedding-3-small: ~$0.30/month. Latency dominates the user's experience, not cost. - For latency-critical chat, self-hosted
bge-m3on a single A10G beats every API embedder on round-trip time. Used by teams at Notion's internal search, Vercel docs search, and many Pinecone customers.
11) The embedder slot across shipped stacks¶
Embedders.
- OpenAI text-embedding-3-small / -large — the default for thousands of RAG stacks on the OpenAI ecosystem.
- Cohere embed-v3 — strong on English and multilingual; built into Cohere Coral and Azure AI.
- Voyage AI voyage-3 / voyage-code-3 / voyage-law-2 — leaderboard-topping on domain benchmarks; used by Anthropic-partner stacks.
- Google text-embedding-005 and gemini-embedding-001 — default in Vertex AI Vector Search and Google's Discovery Engine.
- Mistral Embed — Europe-hosted alternative for sovereignty-sensitive deployments.
- Jina embeddings v3 — long-context (8K) embedder used in Jina's neural search and many open-source stacks.
- BAAI BGE family (bge-m3, bge-large-en, bge-reranker-v2-m3) — the open-source workhorse. Apache-licensed. Used everywhere from Hugging Face Spaces to internal enterprise indexes.
- E5 family (e5-large-v2, multilingual-e5-large) — Microsoft Research; strong baseline for self-hosted RAG.
- NV-Embed-v2 — NVIDIA's 7B-parameter embedder, top of MTEB for a stretch.
- Stella v5 / GTE-large / SFR-Embedding-Mistral — competitive open-source options on MTEB.
- ColBERT v2 / ColBERTv2.5 / ConstBERT — late-interaction models for precision-sensitive retrieval.
Products that depend on these embedders.
- Notion AI Q&A — workspace search grounded on embedded blocks.
- Pinecone Assistant — managed RAG using OpenAI or Cohere embedders by default.
- Weaviate — bring-your-own-embedder; built-in modules for OpenAI, Cohere, Hugging Face.
- Qdrant — open-source vector DB; embedder-agnostic.
- Vespa — search engine with first-class embedding fields; used by Spotify and Yahoo News.
- Elastic ELSER — Elastic's sparse embedder, an alternative to dense — keyword-like vectors that integrate with Elasticsearch BM25.
- pgvector on Postgres — embedder-agnostic; used by Supabase, Neon, and direct Postgres shops.
- MongoDB Atlas Vector Search — Atlas-native vector index; works with any embedder you supply.
- Azure AI Search — vector + semantic ranker; supports OpenAI and Cohere embedders out of the box.
- Vertex AI Vector Search — Google's managed vector DB; defaults to Google's own embedders.
- Perplexity AI — uses internal embedders tuned for web-page passage retrieval.
- Glean — enterprise SaaS search; internal embedder fine-tuned on permissioned corporate corpora.
- GitHub Copilot Chat — code-aware embedders over repository content.
- Cursor and Windsurf — codebase embeddings for editor-side retrieval; OpenAI and Voyage are common.
- Anthropic Claude Projects — user-supplied corpus embedded with Voyage by default.
- Intercom Fin and Zendesk AI — support-knowledge-base embeddings; many use Cohere or OpenAI.
- Hebbia, Harvey, Casetext — legal/finance vertical RAG; lean on Voyage-law-2, custom fine-tunes.
- LangChain and LlamaIndex — orchestration libraries that wrap every major embedder behind a common interface.
- Sentence-Transformers / SBERT — the open-source library that started the dense-retrieval era; still the easiest way to self-host.
The embedder is the load-bearing choice. Pick once, pick well.
12) Recall — embeddings cold, eight questions¶
- What is the one rule about indexing and querying with embedders?
- Name three things embeddings encode well, and three they blur.
- Why is "Apple Inc" vs "apple fruit" only sometimes a problem?
- What does Matryoshka give you that simple truncation does not?
- When does domain-tuning an embedder actually pay off?
- Why are dense embeddings not strictly better than BM25?
- What is the difference between sentence embeddings and token embeddings?
- Why does normalization matter, and which similarity metric does it serve?
13) Interview Q&A¶
Q1. What is an embedding, in one sentence? A. A fixed-length numeric vector produced by a learned model that places semantically related texts at nearby points in vector space. Common wrong answer to avoid: "A compressed version of the text that can be decoded back."
Q2. Why must the same embedder be used to index documents and to embed queries? A. Because each model learns its own geometry; vectors from different models live in unrelated spaces and cosine similarity between them is noise. Common wrong answer to avoid: "As long as the dimensions match, you can mix embedders."
Q3. Are dense embeddings strictly better than BM25? (Trap.) A. No. Dense embeddings win on paraphrase, synonym, and intent. BM25 wins on rare entities, exact identifiers, and numbers. Production systems combine them with hybrid retrieval (RRF fusion). Anyone who claims dense is strictly better has not shipped a system that handles user queries with product codes or version numbers in them. Common wrong answer to avoid: "Yes, dense beats sparse in every benchmark."
Q4. Why does dimension 256 sometimes work as well as 1536? A. With Matryoshka-trained embedders, the first 256 dimensions are themselves a usable embedding. Truncating a non-Matryoshka model usually fails. Even with MRL, very low dimensions lose resolution in dense regions of the meaning map — the cliff appears somewhere between 64 and 128 dims for most general embedders. Common wrong answer to avoid: "More dimensions are always better."
Q5. How would you handle "refunds allowed" vs "refunds not allowed" looking too similar?
A. Three levers. One, hybrid retrieval — BM25 picks up the "not" token reliably. Two, a cross-encoder reranker that reads query and chunk together. Three, structured metadata fields and filtering — if your policy chunks have a boolean refund_allowed flag, retrieval is exact, not semantic.
Common wrong answer to avoid: "Use a bigger embedder and it will figure it out."
Q6. When would you fine-tune an embedder instead of using OpenAI or Cohere off the shelf?
A. When your corpus is dominated by domain-specific tokens the general embedder rarely saw — legal contracts, biomedical text, internal product taxonomies. Fine-tuning a bge-m3 or e5-large on a few thousand query-chunk pairs typically gains 10-20 points of recall@10 on niche domains for a few hundred dollars of GPU time. Skip it for general English.
Common wrong answer to avoid: "Fine-tune the embedder for every project, just to be safe."
Q7. What's the difference between ColBERT-style late-interaction retrieval and standard dense retrieval? A. Standard dense retrieval stores one vector per chunk and compares query vector to chunk vector. ColBERT stores one vector per token and compares token-by-token via a max-similarity operation at query time. Late interaction is sharper on fine-grained matching (e.g., patents, contracts) but costs 10-50x more storage and slower query-time compute. Used where precision beats cost. Common wrong answer to avoid: "ColBERT is just a bigger BERT."
Q8. You upgrade your embedder from text-embedding-3-small to text-embedding-3-large. What do you do to the existing vector store?
A. Re-embed and re-index the entire corpus. The two models produce vectors in different spaces and at different dimensions (1536 vs 3072). You cannot mix. The right pattern is: build a new index alongside the old, dual-write for a verification window, run side-by-side retrieval evaluation, then cut over. Never silent upgrade.
Common wrong answer to avoid: "Just embed the new docs with the new model — the old ones are fine."
14) Apply now (10 min)¶
Step 1 — model the example. Take the four chunks A, B, C, D from section 4. Predict the cosine matrix you would get from:
- text-embedding-3-small (1536 dims)
- text-embedding-3-small at 256 dims (MRL)
- bge-m3 (1024 dims, self-hosted)
Write the matrices side by side. Where do you expect A-B to drop first as dimensions shrink? Where do you expect C to start drifting toward A?
Step 2 — your turn, with your data. Pick three real chunks from a product you know — two on the same topic with different wording, one on an unrelated topic. Embed them with two different embedders. Compute pairwise cosines. Write down: which embedder separates topic from paraphrase more cleanly?
Step 3 — sketch from memory. Redraw the 2D meaning-space diagram from section 4. Place A, B, C, D. Then draw the mismatch picture — two unrelated spaces side by side, with a red line crossed through any attempt to compare vectors across them.
What you should remember¶
This chapter explained what an embedding actually is: a fixed-length numeric vector produced by a learned model that places semantically related text at nearby points in vector space. Each chunk on the bookshelf carries one of these vectors as its index card, and the librarian finds neighbours by comparing query vector to chunk vector. The whole game of dense retrieval is making this geometry honest for your corpus.
You learned the one absolute rule — index and query with the same embedder, because two models produce vectors in different geometries and cosine between them is noise. You learned what embeddings encode well (paraphrase, intent, topic) and what they blur (negation, exact identifiers, rare entities). You learned why dense is not strictly better than BM25, why hybrid retrieval is the production default, and why Matryoshka-trained models let you trade dimensions for cost without retraining.
Carry this diagnostic forward: when retrieval misses on identifiers, version numbers, or "not allowed" phrasings, do not reach for a bigger embedder. Add BM25 to the stack and fuse the scores. The cheapest dense embedder plus BM25 beats the most expensive dense embedder alone on real query mixes.
Remember:
- Same embedder for index and query. Mixing geometries silently breaks retrieval.
- Dense wins on paraphrase, sparse wins on rare tokens and exact match. Hybrid wins in production.
- Negation is a known weak spot. Use BM25, a cross-encoder, or structured filters when polarity matters.
- Matryoshka lets you truncate; non-MRL models do not. Test the cliff before committing to a smaller dim.
- Upgrading an embedder means re-indexing the entire corpus. There is no silent upgrade.
Bridge. Embeddings exist. Each chunk has its index card. But how does the librarian decide which cards are "close"? Cosine? Dot product? Euclidean? And once you have a metric, how do you pick the embedder that produces the best geometry for your corpus? The next file pins that down.