01. The PDF that broke a working RAG demo¶
~12 min read. The most common production-RAG failure has nothing to do with retrieval.
Built on the ELI5 in 00-eli5.md. Before the librarian can shelve a book, the book must actually be readable.
[Stub — to be written]
Outline:
- The story: a RAG demo that worked beautifully on clean markdown, then collapsed when given the client's real PDFs
- Three failure modes inside one PDF: column reading order, embedded tables, scanned appendix pages
- Why "just use PyPDF" produces silently broken chunks
- The hidden truth: PDF is a layout language, not a content language
- The cost shape: ingestion failures look like retrieval failures during debugging, which sends teams chasing the wrong fix
- What this module fixes vs what it cannot fix (e.g., genuinely bad source content)