Skip to content

00. Document Ingestion — The Five-Year-Old Version

Before retrieval can be smart, the text must be clean. This module is about turning real-world files into clean text.


[Stub — to be written]

Imagine a library where every book arrives in a different language, some pages glued together, some written upside down, some photographed instead of typed.

The librarian's first job is not to find books. It is to make the books readable at all.

A PDF is a printer's instruction sheet, not a document. A scanned form is a picture, not text. A slide deck is a collage. A spreadsheet is a grid. An email is a thread.

If you feed any of these directly into a chunker, you get garbage chunks. Garbage chunks make garbage embeddings. Garbage embeddings make a confidently wrong RAG system.

This module is the librarian's apprenticeship. Before module 13 (RAG fundamentals) teaches what to do with clean text, this module teaches how to make text clean.


Module outline:

  • The opening failure: the PDF that broke a working RAG demo
  • The file format zoo: what each format actually is
  • PDF parsing strategies — native text vs reconstructed text
  • OCR for scanned documents
  • Layout-aware extraction (Unstructured, Marker, Nougat, Docling, LlamaParse)
  • Table extraction
  • Vision-LLMs as universal parsers
  • Office formats (docx, pptx, xlsx)
  • HTML and web content
  • Image and figure handling
  • Metadata preservation (page numbers, headings, source attribution)
  • Handoff to the chunker
  • Evaluating ingestion quality
  • Honest admission