04. Chunking strategies — the toolkit¶
~12 min read. The previous chapter set the trade-off. This one is the toolkit. Six strategies. Same document. Six different bookshelves.
Builds on 03-chunking-tradeoffs.md. The librarian does not arrange the bookshelf randomly. She picks a strategy. Different strategies cut the same page differently.
Hook — the contract clause that ruins a fixed splitter¶
Picture a legal team using a RAG system over vendor contracts. A user asks, "What is the cap on indirect damages?" The relevant clause reads like this.
"Notwithstanding anything to the contrary, the aggregate liability of either party for indirect, consequential, or punitive damages shall not exceed the fees paid in the twelve months preceding the claim, except in cases of gross negligence or willful misconduct."
The fixed-size splitter cuts at token 512. The cut lands exactly between "shall not exceed" and "the fees paid in the twelve months." Now the retriever finds the first half. The cap number is in the second half. The model answers, "There is no specific cap on indirect damages." Confident. Wrong. Defensible-looking.
A recursive splitter would have stopped at the sentence end. A semantic splitter would have kept the "notwithstanding... except" arc together. A layout-aware splitter would have respected the clause as one numbered block. Same document. Same query. Three strategies. Only the last two get the answer right.
This chapter is how you pick.
The running example — one support article, six chunkings¶
We will use one document throughout. A 5-page support article titled "Refund Policy for Enterprise Customers." Its structure:
H1: Refund Policy for Enterprise Customers
H2: Eligibility
P: Annual plans may request a refund within 30 days of renewal.
P: Custom integration work voids refund eligibility.
H2: Process
P: Submit through the billing portal.
P: CFO approval is required for refunds above $50K.
H2: Exceptions
P: Refunds past 30 days require manager approval...
P: Refunds are issued as credit, not cash.
H2: Regional rules
Table: EU | UK | US | APAC | refund-window | tax-treatment
One source. Same paragraphs. Watch how six strategies cut it.
1) Fixed-size — the ruler¶
The simplest splitter. Every 512 tokens, cut. Slide forward. Repeat. No structure awareness.
Document: ──H1──H2──P1─P2──H2──P3─P4──H2──P5─P6──Table──
Fixed: ──────|─────────|─────────|─────────|─────────
chunk1 chunk2 chunk3 chunk4 chunk5
^ ^ ^
cuts mid-paragraph, mid-row, mid-clause
When it wins. Uniform text. Clean prose. No headings to respect. Logs. Transcripts. Chat history. Benchmarks where you need a reproducible baseline.
Failure modes. - Cuts mid-sentence, mid-clause, mid-table-row. - Header lands at the end of one chunk; its content lands at the start of the next. - Code functions get severed mid-body.
Numbers. Default 512 tokens with 50-token overlap. Cheapest. ~0 ms per cut. Storage matches token count exactly.
Mini-FAQ. "If fixed-size is this dumb, why is it everywhere?" Because it is fast, deterministic, and "good enough" for clean prose. Many production stacks start here and only graduate when retrieval evals show structure-aware methods winning. Never assume; measure.
2) Recursive character splitter — the fallback ladder¶
Tries the largest natural separator first. If the resulting chunk is still too big, it falls back to a smaller separator. Then smaller again. The LangChain RecursiveCharacterTextSplitter is the canonical implementation.
Separator priority: "\n\n" → "\n" → ". " → " " → char
Document: ──H1\n\nH2\n\nP1\n\nP2\n\nH2\n\nP3...
Recursive: ──────|──────|──────|──────|──────|
splits at "\n\n" first
if a paragraph > 512, split at "\n"
if still big, split at ". "
only chars as the last resort
When it wins. Markdown, plain prose, technical docs with clean paragraph breaks. The author already encoded structure with newlines; you exploit it.
Failure modes.
- Long paragraphs with no internal newlines still get raw-character cuts.
- Lists with - bullets are not separators by default — bullet items can fuse.
- Sentence-final punctuation in abbreviations ("e.g.") confuses the ". " rule.
Numbers. Same 512/50 defaults as fixed-size. Adds maybe 1–2 ms per document for separator scanning. Free, basically.
Mini-FAQ. "Recursive splitter vs sentence splitter — what's the difference in practice?" Recursive falls back through a separator hierarchy and may stop at paragraph or even mid-paragraph if chunks are too big. Sentence splitter goes straight to sentence boundaries using a parser like spaCy or NLTK. Sentence is sharper for QA over prose; recursive is more general-purpose.
3) Sentence-aware splitter — the careful reader¶
Uses a real sentence tokenizer (spaCy, NLTK Punkt, syntok) to find sentence ends. Then groups N sentences per chunk, with a few sentences of overlap. LlamaIndex SentenceSplitter is the reference.
Document: S1. S2. S3. S4. S5. S6. S7. S8. S9. S10. S11. S12.
Sentence: [S1 S2 S3 S4 S5] [S5 S6 S7 S8 S9] [S9 S10 S11 S12]
chunk 1 chunk 2 chunk 3
^ shared sentence overlap ^
Notice the overlap is now measured in sentences, not tokens. That keeps boundary facts intact and avoids cutting a clause.
When it wins. FAQs, knowledge-base articles, customer-support content, narrative prose. Anywhere a single sentence often holds one fact.
Failure modes. - Long compound sentences become single huge chunks if you set sentences-per-chunk too low. - Multi-language documents — tokenizers tuned for English mis-segment Devanagari, Arabic, or Chinese. - Tables and code embedded in prose break sentence detection.
Numbers. ~5–20 ms per document for tokenization. Storage close to source. Common defaults: 5 sentences per chunk, 1 sentence overlap.
4) Semantic chunker — the topic detector¶
Embed each sentence. Compute similarity between adjacent sentences. When similarity drops below a threshold, declare a boundary. Greg Kamradt popularized this; LlamaIndex SemanticSplitterNodeParser ships it.
S1 S2 S3 || S4 S5 || S6 S7 S8 || S9 S10
sim 0.91 0.88 0.42 0.85 0.39 0.87 0.90 0.31 0.84
^topic ^topic ^topic
shift shift shift
Result: [S1 S2 S3] [S4 S5] [S6 S7 S8] [S9 S10]
refunds process exceptions regional
The cuts land where meaning actually shifts, not where punctuation happens to fall.
When it wins. Long prose with weak headings. Transcripts. Meeting notes. Old wiki dumps. The contract clause from the hook — semantic chunking keeps the "notwithstanding... except" arc whole.
Failure modes. - Threshold tuning is fragile — too high, every sentence is its own chunk; too low, chunks become huge. - Costs an extra embedding pass at indexing time: 1 embedding per sentence on top of 1 per final chunk. Roughly 2× the embedding spend. - For 1M sentences at \(0.02/1M tokens with text-embedding-3-small (~80 tokens/sentence average): an extra ~\)1.60 just for boundary detection. Cheap. With LLM-based "agentic" chunkers that call GPT-4o-mini per boundary candidate, the cost jumps 100–1000×. - Subtle topic drift gets missed; abrupt topic returns get falsely split.
Mini-FAQ. "Is semantic chunking always better?" No. On structured docs (markdown with good headings, code, tables), document-aware beats it for free. Semantic shines on long unstructured prose where headings are missing or unreliable. Measure on your corpus; do not assume.
5) Hierarchical / parent-document — two zoom levels¶
Index small children for sharp retrieval. Keep the parent for surrounding context. At query time, retrieve children, then return their parent (or both) to the LLM. LangChain calls this ParentDocumentRetriever; LlamaIndex calls it auto-merging retrieval.
Parent (H2: Exceptions, ~800 tokens)
├── child 1: "Refunds past 30 days require manager approval."
├── child 2: "Refunds are issued as credit, not cash."
├── child 3: "Disputed charges follow a separate process."
└── child 4: "Fraud cases are escalated to legal."
Retriever matches child 1. Reading desk receives: parent block.
^ child 1 is inside, plus its siblings
When it wins. Long policy docs. Legal contracts. Technical manuals. Anywhere a tight match needs surrounding context to be answerable. The matched sentence alone is too thin; the full section is just right.
Failure modes. - Storage doubles or triples (children + parents indexed separately). - Picking the wrong parent granularity — chapter-level parents are too big; sentence-level parents collapse to flat retrieval. - De-duplication gets harder when many children point to the same parent.
Numbers. Storage cost ~2–3× flat chunking. Common shapes: parent 1500–2000 tokens, child 200–400 tokens.
Mid-chapter checkpoint — three strategies, three signatures¶
Without scrolling up, answer these three.
- Which strategy explicitly pays an embedding cost per sentence at indexing time?
- Which strategy returns more context than it matches on?
- Which strategy is fast, deterministic, and ignores all structure?
If you can answer cleanly, continue. If not, re-read sections 1, 4, and 5.
6) Layout-aware — the document parser¶
PDFs, scanned forms, slides, tables, multi-column papers. Plain text extraction destroys them. Layout-aware splitters use a parser that reconstructs structure: tables stay as tables, figures get captions, multi-column flow gets re-stitched in reading order.
PDF page:
┌───────────────────────┬───────────────────────┐
│ Column 1: refund text │ Column 2: tax notice │
│ continues here... │ continues here... │
├───────────────────────┴───────────────────────┤
│ Table: region | window | tax │
│ EU | 30d | VAT inclusive │
│ US | 30d | sales tax extra │
└───────────────────────────────────────────────┘
Naive text extract: "Column 1: refund text Column 2: tax notice continues here..."
^ wrong column order, table flattened
Layout-aware extract: ["Column 1 paragraph", "Column 2 paragraph",
"<table>region|window|tax | EU|30d|VAT...</table>"]
^ table kept as a unit
Markdown gets the same treatment — split at H1/H2/H3 boundaries. Code gets AST-based splits: one function or class per chunk, with imports preserved at the top.
When it wins. Financial filings. Scientific papers with figures and tables. Insurance forms. Legal contracts with numbered clauses. Codebases. Anything the layout itself carries meaning.
Failure modes. - Parser quality is the ceiling — bad OCR on scanned PDFs cascades into bad chunks. - Tables larger than chunk size still need a strategy (row-grouped? summary + rows?). - Code splitters that ignore call graphs miss cross-function context.
Numbers. Heavy. LlamaParse charges per page; Unstructured.io and Docling are free but slow (200 ms – 2 s per page). For a 100K-page corpus, that is real money or real wall-clock time.
Mini-FAQ. "When does layout-aware actually matter?" When the visible structure carries semantic weight. A 10-K filing where revenue lives inside a table. A medical record where the "Medications" section is structurally separate from "History." A contract where clause 7.3.2 is a distinct legal unit. If your retrieval gets the right text but wrong row-of-the-table, you need layout-aware.
Decision tree — which strategy when¶
START
│
▼
Is the file a PDF, scanned doc, or has tables/figures?
│
├── yes → LAYOUT-AWARE (Unstructured, LlamaParse, Docling)
│
└── no
│
▼
Is it code?
│
├── yes → AST-AWARE (tree-sitter, language-aware splitters)
│
└── no
│
▼
Are headings/paragraphs reliable?
│
├── yes → RECURSIVE or SENTENCE-AWARE
│ (LangChain recursive / LlamaIndex SentenceSplitter)
│
└── no
│
▼
Is it long prose with topic drift?
│
├── yes → SEMANTIC (extra embedding cost)
│
└── no → FIXED-SIZE (baseline)
ALWAYS consider HIERARCHICAL on top when context > match-precision matters.
The rule is not "use the smartest strategy." The rule is "match the strategy to the structure of the source." A semantic chunker on clean markdown is wasted spend. A fixed splitter on a 10-K filing is malpractice.
Cost and latency, at a glance¶
| Strategy | Indexing latency | Indexing cost | Storage | Retrieval quality on clean prose | On layout-heavy docs |
|---|---|---|---|---|---|
| Fixed-size | ~0 ms/doc | $0 | 1× | OK | Poor |
| Recursive | ~1 ms/doc | $0 | 1× | Good | Poor |
| Sentence-aware | ~10 ms/doc | $0 | 1× | Good | Poor |
| Semantic | 50–500 ms/doc | ~2× embedding | 1× | Best | Mediocre |
| Hierarchical | depends on base | +child indexing | 2–3× | Good | Good |
| Layout-aware | 200 ms – 2 s/page | $$ parser fees | 1× | OK | Best |
Always qualify by stack. LlamaParse pricing differs from Unstructured self-hosted, which differs from running a Docling container yourself.
Chunking-strategy choices across the toolkit¶
- LangChain
RecursiveCharacterTextSplitter— default recursive splitter shipped with the LangChain framework. - LlamaIndex
SentenceSplitter— sentence-aware default in LlamaIndex pipelines. - LlamaIndex
SemanticSplitterNodeParser— built-in semantic chunker using sentence embeddings. - LlamaIndex
HierarchicalNodeParserand auto-merging retriever — parent-document retrieval, packaged. - LangChain
ParentDocumentRetriever— parent-document pattern in LangChain. - Unstructured.io — production document parser with layout-aware chunking for PDFs, HTML, DOCX, PPTX.
- LlamaParse — managed PDF parser from LlamaIndex, layout and table aware, per-page pricing.
- Docling (IBM Research) — open-source layout parser with strong table extraction.
- Marker — open-source PDF-to-markdown converter with layout preservation.
- Nougat (Meta AI) — scientific PDF parser tuned for equations and tables.
- Anthropic Projects — internal chunking of user-uploaded files for grounded retrieval.
- OpenAI Assistants File Search — managed chunking and retrieval over attached files.
- Pinecone Assistant — managed RAG with built-in chunking and reranking.
- AWS Bedrock Knowledge Bases — configurable chunking strategies (fixed, hierarchical, semantic) as a managed service.
- Google Vertex AI Search — built-in document chunking for grounded generative answers.
- Azure AI Search — integrated vectorization with built-in text-splitting skills.
- Cohere Compass — multimodal retrieval platform with structure-aware chunking.
- Notion AI Q&A — block-aware splitting that respects Notion's block model.
- Glean — enterprise search with permission-aware, source-aware chunking.
- Cursor — code-aware retrieval with file-symbol-level chunking for repositories.
- GitHub Copilot Chat — code retrieval over repository context with symbol-aware splits.
- Google Document AI Layout Parser — managed layout extraction for forms and contracts.
- Microsoft Syntex / Copilot for M365 — Graph-aware retrieval respecting document structure.
- Vectara — RAG-as-a-service with built-in chunking and faithfulness scoring.
- Hebbia / Harvey / Casetext — domain RAG for finance and law with clause-aware chunking.
- NotebookLM — Google's source-grounded QA with chunking tuned for user-uploaded notes.
- Slack AI — channel-history chunking aware of thread structure.
Different stacks. Same toolkit. The names rotate; the strategies do not.
Recall — match strategy to corpus, cold¶
- On our refund-policy document, which strategy keeps the H2-section "Exceptions" intact as one unit?
- Which strategy pays an extra embedding pass at indexing time, and roughly how much extra cost?
- When does sentence-aware lose to semantic, and when does it win?
- What is the storage cost of hierarchical chunking compared to flat chunking?
- Why is layout-aware mandatory for 10-K filings but overkill for a blog post?
- What goes wrong if you run a fixed-size splitter on a Python file?
- Name one failure mode unique to semantic chunking that recursive does not have.
- Why is "always pick the smartest strategy" the wrong rule?
Interview Q&A¶
Q1. When would you pick recursive over semantic chunking? A. When the document already has reliable structural separators — markdown headings, clean paragraphs, well-formatted HTML. Recursive captures the author's intent for free; semantic spends extra embedding compute to rediscover boundaries the author already marked. Common wrong answer to avoid: "Semantic is always better; recursive is just a baseline."
Q2. What does parent-document retrieval solve that flat chunking does not? A. The precision-versus-context trade-off. Small chunks match sharply but lack surrounding context; large chunks have context but match weakly. Parent-document retrieval matches on small children and returns the larger parent, so you get both. Common wrong answer to avoid: "It just stores duplicate copies for redundancy."
Q3. What is the indexing cost of semantic chunking? A. Roughly 2× the embedding cost of a non-semantic strategy because every sentence gets embedded twice — once for boundary detection during chunking, once for the final chunk vector. If LLM-based boundary detection is used instead of embeddings, the cost can be 100–1000× higher per document. Common wrong answer to avoid: "It's free because embeddings are cheap."
Q4. Why does fixed-size chunking destroy contracts and 10-K filings? A. Because legal and financial documents encode meaning in clause numbers, table rows, and section boundaries. A token-counter ruler ignores all of that and cuts mid-clause or mid-row, separating a value from its label or a rule from its exception. The retrieval finds half a fact. Common wrong answer to avoid: "It works fine; you just need more overlap."
Q5. When is layout-aware chunking overkill? A. On clean prose where layout carries no semantic weight — blog posts, articles, transcripts, support documentation in markdown. Paying parser fees and accepting 200 ms – 2 s per page is wasted spend when recursive or sentence-aware would have done the job. Common wrong answer to avoid: "Layout-aware is always better; it just costs more."
Q6. Your retrieval gets the right document but wrong row in a table. Which strategy fixes this? A. Layout-aware chunking that treats the table as a structural unit and either keeps it whole as one chunk or splits it row-by-row with column headers preserved on each row. Plain text extraction flattens tables and destroys the column-to-cell relationship the retriever needs. Common wrong answer to avoid: "Smaller chunks; the row will become its own chunk eventually."
Q7. How would you chunk a Python codebase for a code-assistance RAG? A. Use an AST-aware splitter (tree-sitter, language-server) that splits at function and class boundaries while preserving imports at the top of each chunk. Add file path and symbol name as metadata. Avoid token-based splitters that sever function bodies mid-line. Common wrong answer to avoid: "Run the recursive splitter; it handles code fine."
Q8. Trap question — semantic chunking gave worse retrieval than recursive. Why might that happen? A. Several reasons: (1) the corpus is structured markdown where recursive already captured boundaries perfectly; (2) semantic threshold was mistuned, producing too-small or too-large chunks; (3) the embedding model used for boundary detection is domain-mismatched (legal text scored by a generic embedder); (4) the corpus has frequent topic returns that semantic falsely splits. Common wrong answer to avoid: "Impossible — semantic is theoretically superior."
Apply now (10 min)¶
Step 1 — model the exercise. Here is the trace I would write for one document type, picking a strategy with a reason.
| Doc type | Chosen strategy | Why | One failure I would log |
|---|---|---|---|
| Markdown product docs | Recursive (LangChain) | Headings and paragraphs are reliable | Long code-block paragraphs forced character splits |
| Vendor contracts | Layout-aware + hierarchical | Clause numbering matters; context > match | Mis-parsed table rows in scanned scans |
| Long support transcripts | Semantic | No headings, topic drift across turns | Topic-return false splits |
| Python codebase | AST-aware (tree-sitter) | Function/class are natural units | Cross-function calls missing context |
| 10-K filings | LlamaParse + hierarchical | Tables and section structure dominate | Table cells flattened on bad pages |
Step 2 — your turn. Pick three document types from your own product. For each, write the strategy you would use, one sentence of justification, and one failure you would set up monitoring for.
Step 3 — sketch from memory. Redraw the decision tree from section "Decision tree — which strategy when" without looking back. Beside each leaf, write one real product that defaults to that strategy. If you can do this cold, you have the toolkit.
What you should remember¶
This chapter explained why "what splitter should I use?" is the wrong question. The right question is what does the source structure already encode? If headings are reliable, recursive captures the author's intent for free. If the layout itself carries meaning — tables in a 10-K, clauses in a contract, multi-column papers — layout-aware is the only honest choice. If topic drift is the dominant pattern, semantic chunking earns its extra embedding cost. Code is its own world and wants an AST splitter.
You learned the decision tree as a sequence of structural questions, not a ranking of "smart vs dumb" strategies. You also learned the often-missed trick: hierarchical (parent-document) sits on top of any base strategy and resolves the precision-vs-context tension by matching on small children and returning the larger parent. The cost is 2–3× storage, which is usually cheap insurance.
Carry this diagnostic forward: when answers are subtly incomplete on a structured corpus, do not blame the embedder. Ask whether the splitter respected the document's structure. A fixed splitter on a contract is malpractice; a semantic splitter on clean markdown is wasted spend.
Remember:
- Match the strategy to the source structure, not to a "best practice" list.
- Recursive is the workhorse for headed prose; sentence-aware for cleaner paragraph splits.
- Layout-aware is mandatory whenever tables, columns, or numbered clauses carry meaning.
- Hierarchical/parent-document is the cleanest fix for "small chunks match sharply but lack context."
- Semantic chunking pays double embedding cost. Use it for topic-drift corpora where boundaries are not already marked.
Bridge. The bookshelf is now arranged. Six strategies, one corpus, six different layouts. But every chunk on every shelf still needs an index card — a vector representation of its meaning. Without that card, the librarian cannot find anything. The next chapter unpacks embeddings.