11. Metadata preservation — the boring part that wins production¶
~10 min read. The difference between "I found an answer" and "I found an answer and can cite it."
[Stub — to be written]
Outline:
- Source attribution: document name, page, section, paragraph
- The page number paradox — PDF pages, logical pages, printed pages
- Heading hierarchy: preserving H1/H2/H3 so chunks know their context
- Authorship and document date — for recency-weighted retrieval and trust signals
- Version metadata — same document, different revisions
- Access-control metadata — which users can see which chunks (cross-ref module 27)
- Language tags for multi-lingual corpora
- Quality scores from ingestion — "this chunk was OCR'd at 78% confidence"
- The metadata schema as a contract between ingestion and retrieval