Skip to content

12. Chunking handoff — what ingestion owes the chunker

~10 min read. The contract between this module and module 13.


[Stub — to be written]

Outline:

  • The principle: ingestion produces a structured document tree, the chunker decides how to slice it
  • The tree shape: sections, paragraphs, tables, figures, lists — each tagged
  • Why "give me one big string" is the wrong handoff
  • Preserving boundaries the chunker should respect (do not split a table, do not split a code block)
  • Soft hints vs hard rules (preferred chunk boundaries, mandatory ones)
  • Markdown as a lingua franca handoff format — strengths and limits
  • JSON-shaped element streams (Unstructured-style) as the alternative
  • When the chunker should re-call ingestion (e.g., zoom into a figure to caption it)
  • Cross-reference forward to chunking strategies in module 13