Skip to content

11. Metadata preservation — the boring part that wins production

~10 min read. The difference between "I found an answer" and "I found an answer and can cite it."


[Stub — to be written]

Outline:

  • Source attribution: document name, page, section, paragraph
  • The page number paradox — PDF pages, logical pages, printed pages
  • Heading hierarchy: preserving H1/H2/H3 so chunks know their context
  • Authorship and document date — for recency-weighted retrieval and trust signals
  • Version metadata — same document, different revisions
  • Access-control metadata — which users can see which chunks (cross-ref module 27)
  • Language tags for multi-lingual corpora
  • Quality scores from ingestion — "this chunk was OCR'd at 78% confidence"
  • The metadata schema as a contract between ingestion and retrieval