03. Chunking trade-offs — the hidden costs of splitting a document¶

~11 min read. A vendor contract had one clause: "Payment is due within 30 days, except for orders flagged as enterprise pilots." The chunker split that sentence between two chunks. Retrieval pulled the first half. The chatbot told a customer their pilot order was overdue. The exception sat 14 tokens away on the next chunk, unseen. The clause was perfect. The split was the bug.

Built on 02-open-book-answering.md. The librarian, the bookshelf, the reading desk, the answer brief — same four placeholders. Chunking is how the librarian decides what page-spans get filed as a single "book" on the bookshelf.

Why chunking is the most under-respected step in RAG¶

Most newcomers think the magic is in embeddings or in the LLM. It is not. Chunking decides what the embedding even sees. A bad chunk gives a bad index card. A bad index card lands in the wrong bookshelf neighbourhood. The librarian then carries the wrong book to the reading desk. The model writes a confident, well-cited, wrong answer.

Chunking is the first irreversible decision in the pipeline. Embeddings can be swapped. Rerankers can be added. Prompts can be rewritten. But once you have chunked badly and indexed at scale, you re-index everything. That is hours of compute and dollars of embedding API calls. So get the trade-off right first time. See.

There are five forces pulling on every chunking choice:

Precision — does each chunk contain mostly one idea?
Recall — will the relevant chunk be retrievable at all?
Context — does the chunk hold enough surrounding text to be self-explanatory?
Prompt budget — how much of the model's context window does the chunk eat?
Cost and latency — how many chunks must we embed, store, and rerank?

These five never align. Pulling one knob always disturbs another. That is the trade-off.

The running example — one paragraph, three chunkings¶

Same vendor contract. Same paragraph. We will use this throughout.

Section 4.2 — Refunds
Enterprise renewals are refundable only within 10 business days
of the renewal date. Requests require written CFO approval and
must be submitted via the billing portal. Refunds do not apply
after custom integration work has begun. Standard self-serve
plans follow a separate 30-day rule described in Section 4.5.

The question the user will ask: "When can an enterprise customer get a refund?"

The correct answer needs three facts glued together — the 10-day window, the CFO approval, and the integration-work exception. Lose any one and the answer is wrong but sounds right. Watch what happens at three chunk sizes.

Chunk-size spectrum — visualised¶

TOO SMALL                 SWEET SPOT              TOO LARGE
~64 tokens                ~512 tokens             ~4000 tokens
┌────────────┐            ┌────────────────┐      ┌─────────────────────┐
│ Enterprise │            │ Enterprise     │      │ ... full chapter    │
│ renewals.. │            │ renewals       │      │ Refunds. Payments.  │
└────────────┘            │ refundable in  │      │ Taxes. Credits.     │
┌────────────┐            │ 10 days. CFO   │      │ Gift cards. Fraud.  │
│ Requests   │            │ approval. No   │      │ Enterprise rule     │
│ require..  │            │ refunds after  │      │ buried inside.      │
└────────────┘            │ integration.   │      │ Self-serve rule too.│
┌────────────┐            │ Self-serve...  │      │ Disputes.           │
│ Refunds do │            └────────────────┘      └─────────────────────┘
│ not apply..│                  ✓ all 3 facts          ✗ signal diluted
└────────────┘                                          ✗ may miss the
   ✗ facts split                                        retrieval rank
   ✗ low recall

A tiny chunk gives precision per chunk but loses the joint meaning. A giant chunk preserves meaning but drowns the relevant lines in unrelated text. The similarity score then ranks the chunk lower than it deserves and the librarian never picks it up. The middle wins because it matches the natural unit of the fact — one rule with its qualifiers.

Mini-FAQ. "Why not just embed the whole document?" Because similarity is computed per item. A 30-page handbook covers 50 topics. A question about refunds matches the refund paragraph strongly, but averaged over 30 pages the score becomes mediocre. Other documents with one tight matching chunk will outrank it. You lose the right answer to a worse document that just happens to be small.

What overlap actually buys you¶

Overlap is the second knob. Adjacent chunks share a tail-and-head stretch of tokens. The point is not redundancy. The point is boundary insurance.

Imagine the chunker cut right between "CFO approval" and "Refunds do not apply after integration work." Without overlap, the integration exception lives alone in chunk 2, while chunk 1 has the 10-day rule. The query "When can enterprise get a refund?" pulls chunk 1. The exception is missing. The answer omits the integration constraint. Customer complains. Engineer debugs. Bug is in the splitter.

WITHOUT OVERLAP                       WITH 15% OVERLAP
chunk 1: 10-day rule | CFO            chunk 1: 10-day rule | CFO | no refund after integ
chunk 2: no refund after integration  chunk 2: CFO | no refund after integ | self-serve
chunk 3: self-serve 30-day rule       chunk 3: integ | self-serve 30-day rule

question hits chunk 1 → exception     question hits chunk 1 → exception present
missing

Typical overlap sits in the 10–20% of chunk size range. For a 512-token chunk that is 50–100 tokens. Less than 10% and boundary facts leak. More than 25% and your bookshelf doubles in size, your embedding bill doubles, and reranking sees near-duplicates competing for top-k slots.

Mini-FAQ. "If overlap is so good, why not 50%?" Three reasons. Storage doubles. Embedding cost doubles. The retriever returns two near-identical chunks for top-2, so the reading desk holds one fact written twice instead of two facts. You spent a chunk-slot on a duplicate.

The five-force trade-off table¶

Each knob you turn pushes some forces up and pulls others down. Memorise this shape.

Knob	Precision	Recall	Self-contained context	Prompt-budget cost	Storage / embed cost
Shrink chunk size	Up	Down	Down	Down per chunk	Up (more chunks)
Grow chunk size	Down	Up (initially)	Up	Up per chunk	Down
Add more overlap	Same	Up	Up at edges	Up (duplicates in top-k)	Up
Cut overlap to zero	Same	Down at edges	Down at edges	Down	Down
Split on sentence boundaries	Up	Same	Up	Same	Same
Split mid-sentence	Down	Down	Down	Same	Same

Look at row 2 carefully. Recall goes up initially with chunk size, then comes back down. Why? Because at some size the relevant line gets diluted enough that similarity drops below the rank cutoff. The curve is not monotonic. That is why "just make chunks bigger" is naive.

The precision-recall curve — drawn out¶

Picture how each metric moves as chunk size grows. This is what production teams find when they tune.

score
 1.0 │
     │       precision
     │    ╲
 0.8 │     ╲___
     │         ╲___
 0.6 │             ╲___
     │     recall      ╲___
 0.4 │   ___---───╲          ╲___
     │ ╱            ╲___
 0.2 │                  ╲___
     │
 0.0 └─────────────────────────────────────►  chunk size
     64    256    512   1024   2048    4096  (tokens)
            ▲
            └── sweet spot for most prose

Precision falls smoothly as the chunk grows — more unrelated text dilutes the signal. Recall rises sharply at first as the chunk starts holding complete thoughts, peaks somewhere between 256 and 1024 tokens, then also falls as the relevant line gets too diluted to outrank competing documents. The intersection of the two curves is what you tune toward. Simple, no?

Prompt-budget interaction — the reading desk constraint¶

Now think one stage downstream. The retriever returns top-k. The reranker re-orders. Then the selector picks top-n to fit on the reading desk. The reading desk size is your prompt budget — typically 4k–32k tokens depending on model and other prompt pieces.

Here is the math nobody draws explicitly.

reading desk budget = 8000 tokens

chunk size = 128 tokens   →   ~60 chunks fit  →  wide coverage, many fragments
chunk size = 512 tokens   →   ~15 chunks fit  →  balanced
chunk size = 2000 tokens  →   ~4 chunks fit   →  deep but narrow
chunk size = 4000 tokens  →   ~2 chunks fit   →  one topic, no comparison

So chunk size is not just an indexing decision. It silently controls how many independent sources the model can see at once. A legal-research question that needs to compare 6 clauses fails at 4000-token chunks because only 2 fit on the desk. A simple FAQ answer needs only 1 chunk and benefits from larger chunks for context.

Mini-FAQ. "Can I just increase the model's context window to 200k tokens and stop worrying?" No. Long-context models suffer from lost-in-the-middle — the relevant fact in chunk 7 of 30 gets under-read compared to chunks 1, 2, 29, 30. More chunks help recall but hurt how reliably the model uses them. Chunking is still a quality lever, not a budget lever.

Failure modes — name them, spot them in logs¶

Every senior RAG engineer can name these by sight.

Clause-split failure. A sentence with "X, except Y" gets cut at the comma. Retrieval finds X without Y. The contract example at the top of this page is this failure.
Mid-sentence cut. Token-window splitters cut at character N regardless of content. The chunk starts mid-phrase. The embedder sees a broken sentence and produces a noisy vector.
Topic blending. A chunk straddles two H2 headings. Half is about refunds, half is about shipping. The embedding sits between two neighbourhoods and lands well in neither.
Header orphaning. The chunk contains a table row but not the column headers. Retrieval pulls "30 | yes | manager" with no idea what those columns mean.
Stop-word drift. Tiny chunks of one short sentence get dominated by common words. Their vectors cluster near each other regardless of topic.
List explosion. A bulleted list of 40 items gets cut into 40 tiny chunks. Each item lacks the list's framing sentence. Every chunk reads "Item: pay within 30 days" with no policy name.
Code-comment severance. In code RAG, the docstring and the function it documents get split. Retrieval finds the comment without the implementation or vice versa.
PDF column scramble. Two-column PDFs read left-to-right naively, merging unrelated columns into one chunk. Real text never had that order.

If you see retrieval scores that look fine but answers that are subtly incomplete, suspect chunking before you suspect the embedder.

Quick recall — three seconds each¶

Before reading further, answer these out loud. Do not scroll back.

What three facts must travel together for the refund question?
Which chunk size lost the integration exception?
What does overlap insure against?

If those are easy, continue. If not, scroll back to the spectrum diagram.

Numbers, qualified by stack¶

There is no single right chunk size. There are defaults that work for typical stacks. Always qualify by stack before quoting a number in an interview.

Content type	Typical chunk size	Typical overlap	Where this default holds
Product docs, FAQs	256–512 tokens	10–15%	LangChain RecursiveCharacterTextSplitter defaults, Pinecone starter guides
Long-form prose, reports	512–1024 tokens	15–20%	LlamaIndex SentenceSplitter, Notion AI workspace docs
Legal contracts, policy	300–600 tokens, clause-aware	0–10% (split on clause)	Hebbia, Harvey, Casetext — clause boundaries beat token windows
Research papers	512–1024 tokens, section-aware	10–15%	Elicit, Consensus, SciSpace — split on section headers
Code	Function or class scope, not tokens	None (AST-bounded)	Cursor, GitHub Copilot Chat, Sourcegraph Cody
Customer support tickets	128–256 tokens	10%	Intercom Fin, Zendesk AI — each turn is small
Chat transcripts	256–512 tokens, turn-aware	0%	Slack AI, Glean — split on speaker change
Tables and structured data	Row-group or row+header pairs	0%	Unstructured.io table partitioner, AWS Textract

Why 512 is the unofficial default. Many embedding models — text-embedding-3-small, text-embedding-3-large, bge-large, e5-large — accept up to 8191 tokens but were trained on sequences in the 200–512 range. Feeding them 4000-token chunks works but produces vectors that under-weight the back half. 512 keeps the chunk inside the embedder's sweet spot.

Why product docs go shorter. Each FAQ entry is already short. A 256-token chunk usually captures one full Q&A. Going larger blends two unrelated questions.

Why prose goes longer. A narrative paragraph builds context across 3–6 sentences. Cutting at 256 tokens breaks that arc. 768–1024 keeps a thought intact.

Mini-FAQ. "What does 'semantic chunking' mean and is it worth the cost?" Semantic chunking splits at points where consecutive sentences become embedding-dissimilar — i.e. where the topic shifts. It needs an embedding pass at indexing time just to decide the splits, then another pass to embed the final chunks. Double embedding cost. Worth it when your content has hard topic shifts (news aggregators, mixed-topic wikis). Wasted on uniformly structured docs (API references, contracts) where structural splitters work fine.

Mini-FAQ. "Should chunk size match the embedder's max tokens?" No. Match the embedder's training distribution, which is usually 200–512 tokens even when the max is 8192. The model accepts longer input but represents it less faithfully.

Chunk-size choices in shipped systems¶

The same trade-off shows up across real products and tools. Each one resolves it differently.

LangChain RecursiveCharacterTextSplitter — recursive splitter that tries paragraph, then sentence, then word boundaries; default 1000 chars with 200 overlap.
LlamaIndex SentenceSplitter and SemanticSplitterNodeParser — sentence-aware token windows plus an embedding-based semantic splitter option.
Unstructured.io — partitions PDFs, HTML, and Office docs by element type (title, paragraph, list, table) before token windows are even applied.
Pinecone Assistant — managed chunking with overlap defaulted to ~15%, exposes chunk size as a tunable knob.
Notion AI Q&A — chunks block-by-block since Notion's data model is already block-based; each toggle, callout, and paragraph is a natural unit.
Glean — enterprise search chunks by document section and applies permission-aware filters at retrieval time.
Perplexity AI — web pages split by section headings; citations point to chunk-level passages with span hints.
GitHub Copilot Chat — repository chunking by file and function scope, not token windows.
Cursor codebase indexing — splits code by symbol boundaries (function, class, method) using tree-sitter parsers.
Anthropic Claude Projects — user-uploaded documents chunked with overlap; recently introduced contextual retrieval which prepends a summary to each chunk to fight orphaning.
OpenAI Assistants File Search — managed chunking around 800 tokens with 400-token overlap, not user-configurable.
Google NotebookLM — source-grounded chunking on uploaded PDFs and docs; chunk boundaries align with document structure.
Vectara — managed RAG with sentence-window chunking and configurable context expansion at retrieval time.
Cohere Embed and Rerank — recommends 512-token chunks for the v3 embeddings family.
Hebbia, Harvey, Casetext — legal RAG stacks split on clause and section boundaries, not token counts.
Elastic + ELSER — chunking inside the search pipeline using ingest processors; sparse embedding per chunk.
Weaviate — hybrid search with chunk-level metadata for filtering; chunking is the user's job.
Vespa — supports per-paragraph indexing with sliding windows for overlap.
Sourcegraph Cody — chunks code by symbol, with separate embeddings for symbols and natural-language descriptions.
Microsoft Copilot for Microsoft 365 (Graph RAG) — chunks email and document content with role-based scoping.
Intercom Fin — splits help-centre articles by H2/H3 headings to keep each chunk one answer.
Zendesk AI agents — knowledge-base articles split at section boundaries with citation pointers back to source.
Slack AI search — chunks channel history by turn, with thread context preserved as metadata.
Salesforce Einstein Copilot — CRM record chunking by entity (account, contact, opportunity) instead of token windows.

Across all of these the knobs are the same — size, overlap, boundary rule. The product picks defaults that suit its corpus.

What chunking does not fix¶

Be honest about the limits. Chunking will not save you from these.

Garbage source documents. If the PDF was scanned badly and the OCR is half-broken, no splitter heals that. Fix ingestion first.
Wrong embedder. A general-domain embedder on biomedical text mis-ranks even perfect chunks. Domain match comes before chunk size.
Missing reranker. Even good chunks in top-50 may not surface to top-3 without a cross-encoder. Tune chunking after rerank is in place.
Multi-hop questions. "How does our refund policy compare to our competitor's?" needs synthesis across documents. No chunk size alone solves this. Multi-step retrieval does.
Stale data. A perfect chunk of a policy from 2022 still answers wrong if the policy changed in 2024. Freshness is an indexing problem, not a chunking one.

Chunking is necessary but not sufficient. Get it right and the other stages right.

A worked tuning loop — what a real team does on day three¶

You ship v1 with 512-token chunks, 15% overlap, recursive character splitter. Users start asking real questions. Now you tune.

Sample 50 queries from production logs that produced bad answers.
For each, inspect the top-3 retrieved chunks alongside the source documents.
Classify the failure: clause-split, header-orphan, topic-blend, dilution, or simply not in corpus.
If 30% of failures are clause-splits → increase overlap to 20% or switch to sentence-aware splitter.
If 30% are dilution → drop chunk size to 384 tokens.
If 30% are header-orphan → use Unstructured.io to add title metadata to each chunk.
Re-embed only the affected document set if possible. Otherwise budget for a full re-index.

This is the loop. Most teams skip step 2 and tune knobs blind. They report "we tried different sizes." That is gambling, not engineering.

One more habit. Version your chunker configuration alongside your embedder version. Save the splitter type, chunk size, overlap, and embedder model into the index metadata. When you re-index next quarter and answers degrade, you will know whether the regression came from a new splitter or a new embedder. Without versioning, you have a haunted system. With it, you have a diff.

Recall — can you name the five chunking forces cold?¶

Name the five forces that pull on every chunking decision.
Why is recall not monotonically increasing in chunk size?
What is the typical overlap range, and what stops you from going higher?
Name three failure modes you can spot in logs and what each looks like.
Why do code repositories chunk by function instead of by token count?
What is the unofficial default chunk size for general prose, and why 512 specifically?
What does "semantic chunking" cost extra, and when is it worth it?
Why is chunking the first irreversible decision in the pipeline?

Interview Q&A¶

Q1. What chunk size should I use? A. There is no universal answer. Start at 256–512 tokens for prose, 128–256 for FAQs, function-scoped for code. Then tune based on retrieval failures in production logs. Always qualify the number with the content type and the embedder's training distribution. Common wrong answer to avoid: "512 tokens is the right answer for everything."

Q2. Why does increasing chunk size sometimes reduce recall? A. Because similarity is computed across the whole chunk. As the chunk grows, the relevant line is averaged with more unrelated text. Past a point the chunk's score drops below the rank cutoff and the librarian never picks it up. Common wrong answer to avoid: "Bigger chunks always have more recall because they contain more information."

Q3. What is the role of overlap? A. Overlap protects facts that sit near chunk boundaries by ensuring the same sentence appears in two adjacent chunks. Typical range is 10–20% of chunk size. Too little leaks boundary facts; too much creates near-duplicate retrievals that waste prompt budget. Common wrong answer to avoid: "Overlap is just for making the database bigger."

Q4. Why not chunk on a fixed character count? A. Because fixed-character splitters cut mid-sentence, mid-word, mid-clause. The embedder then sees a broken phrase and produces a noisier vector. Sentence-aware or token-aware splitters preserve linguistic units and produce cleaner embeddings. Common wrong answer to avoid: "Character splitting is fine because tokens are roughly four characters."

Q5. How would you debug a RAG system where retrieval scores look good but answers are subtly incomplete? A. Pull 20 bad answers, look at the retrieved chunks, and check whether the relevant fact's qualifiers — exceptions, conditions, dates — were on the next chunk. If yes, the failure is a clause-split caused by undersized chunks or insufficient overlap. Fix the splitter before touching the embedder. Common wrong answer to avoid: "I would swap the embedding model."

Q6. Why are some products like Cursor and Copilot Chat using function-scoped chunks instead of token windows? A. Because code's natural unit is the function or method. Cutting mid-function produces chunks where the call site is in one chunk and the implementation in another. Splitting by AST node — function, class, method — keeps semantic units intact and matches how developers reason about code. Common wrong answer to avoid: "Code is the same as prose, you can just use a token splitter."

Q7. What is semantic chunking, and what does it cost? A. Semantic chunking embeds sentences at indexing time and places splits where consecutive sentences become embedding-dissimilar. The benefit is splits that align with topic shifts. The cost is roughly double embedding compute at indexing time and added complexity in the pipeline. Worth it for mixed-topic corpora; wasted on uniformly structured docs. Common wrong answer to avoid: "Semantic chunking is always better than token-based chunking."

Q8. Why is chunking called the first irreversible decision in RAG? A. Because once you have embedded and indexed a corpus, changing the chunker means re-embedding everything. For a 10M-document corpus on a paid embedding API, that is a multi-thousand-dollar operation. Embeddings, rerankers, and prompts can be swapped cheaply. Chunking cannot. Common wrong answer to avoid: "You can just re-chunk on the fly at query time."

Apply now (10 min)¶

Step 1 — model the exercise. Here is the trace I would write for our contract paragraph at three chunk sizes.

Chunk size	Overlap	Splitter	Likely top-1 chunk	Likely answer quality
64 tokens	0	Token window	"Enterprise renewals are refundable only within 10 business days."	Incomplete — misses CFO and integration exception
512 tokens	15%	Sentence-aware	All four sentences in one chunk	Complete
4000 tokens	15%	Token window	Full chapter incl. taxes, fraud, gift cards	Retrieval may miss; if retrieved, dilution risk

Step 2 — your turn. Take one real document from your own work. A policy, a wiki page, a README. Pick one realistic question against it. Now chunk it three ways — tiny, medium, huge — and predict which chunks would be retrieved. Write down the missing facts in each version.

Step 3 — name the failure mode. For each of the three versions, name the specific failure mode from the list above (clause-split, dilution, header-orphan, etc.). If you cannot name one, the version is probably fine.

Step 4 — sketch from memory. Draw the spectrum diagram from this page. Three columns: too small, sweet spot, too large. Beside each, the failure mode and one real product that lives there.

What you should remember¶

This chapter explained why chunk size is the first irreversible decision in a RAG pipeline. Too small and clauses get split from their qualifiers; too large and the relevant sentence drowns in surrounding noise. Five forces pull on every decision — retrieval recall, prompt budget, embedder window, semantic coherence, storage cost — and no single number satisfies all five.

You learned to spot the three named failures in production logs: clause-split (the qualifier sits in chunk B), header-orphan (the section title is gone), and dilution (the right paragraph is buried in a 4k-token chunk). You also learned the tuning loop a real team runs: pull 50 bad answers, inspect the retrieved chunks against the source, classify the failure, and adjust the knob the failures point to. Most teams skip the inspection step and tune blind.

Carry this diagnostic forward: when answers look subtly incomplete, suspect a clause-split before suspecting a weak embedder. Read the chunks next to the source paragraph. If the qualifier sentence is one chunk over, the embedder is innocent — the splitter is the bug.

Remember:

Chunking is one decision balancing five forces. There is no universal best size.
Recall is not monotonic in chunk size. Past the sweet spot, dilution drops the chunk's similarity score below the cutoff.
Overlap (10–20%) is insurance against boundary facts, not free recall lift.
Match chunk size to the embedder's training distribution, not its max-token limit.
Version your chunker config in the index metadata. Without it, every re-index is a haunted regression.

Bridge. The trade-offs are clear now. Small loses context, large dilutes signal, overlap insures boundaries at a storage cost, and the default sizes change by content type. The next file moves from what the trade-offs are to what strategies actually beat naive token windows — recursive splitting, semantic splitting, document-aware splitting, and the contextual retrieval trick Anthropic published.

→ 04-chunking-strategies.md