Skip to content

01. The PDF that broke a working RAG demo

~12 min read. The most common production-RAG failure has nothing to do with retrieval.

Built on the ELI5 in 00-eli5.md. Before the librarian can shelve a book, the book must actually be readable.


[Stub — to be written]

Outline:

  • The story: a RAG demo that worked beautifully on clean markdown, then collapsed when given the client's real PDFs
  • Three failure modes inside one PDF: column reading order, embedded tables, scanned appendix pages
  • Why "just use PyPDF" produces silently broken chunks
  • The hidden truth: PDF is a layout language, not a content language
  • The cost shape: ingestion failures look like retrieval failures during debugging, which sends teams chasing the wrong fix
  • What this module fixes vs what it cannot fix (e.g., genuinely bad source content)