12. Chunking handoff — what ingestion owes the chunker¶
~10 min read. The contract between this module and module 13.
[Stub — to be written]
Outline:
- The principle: ingestion produces a structured document tree, the chunker decides how to slice it
- The tree shape: sections, paragraphs, tables, figures, lists — each tagged
- Why "give me one big string" is the wrong handoff
- Preserving boundaries the chunker should respect (do not split a table, do not split a code block)
- Soft hints vs hard rules (preferred chunk boundaries, mandatory ones)
- Markdown as a lingua franca handoff format — strengths and limits
- JSON-shaped element streams (Unstructured-style) as the alternative
- When the chunker should re-call ingestion (e.g., zoom into a figure to caption it)
- Cross-reference forward to chunking strategies in module 13