00. Document Ingestion — The Five-Year-Old Version¶

Before retrieval can be smart, the text must be clean. This module is about turning real-world files into clean text.

[Stub — to be written]

Imagine a library where every book arrives in a different language, some pages glued together, some written upside down, some photographed instead of typed.

The librarian's first job is not to find books. It is to make the books readable at all.

A PDF is a printer's instruction sheet, not a document. A scanned form is a picture, not text. A slide deck is a collage. A spreadsheet is a grid. An email is a thread.

If you feed any of these directly into a chunker, you get garbage chunks. Garbage chunks make garbage embeddings. Garbage embeddings make a confidently wrong RAG system.

This module is the librarian's apprenticeship. Before module 13 (RAG fundamentals) teaches what to do with clean text, this module teaches how to make text clean.

Module outline:

The opening failure: the PDF that broke a working RAG demo
The file format zoo: what each format actually is
PDF parsing strategies — native text vs reconstructed text
OCR for scanned documents
Layout-aware extraction (Unstructured, Marker, Nougat, Docling, LlamaParse)
Table extraction
Vision-LLMs as universal parsers
Office formats (docx, pptx, xlsx)
HTML and web content
Image and figure handling
Metadata preservation (page numbers, headings, source attribution)
Handoff to the chunker
Evaluating ingestion quality
Honest admission