01. The PDF that broke a working RAG demo¶

~12 min read. The most common production-RAG failure has nothing to do with retrieval.

Built on the ELI5 in 00-eli5.md. Before the librarian can shelve a book, the book must actually be readable.

[Stub — to be written]

Outline:

The story: a RAG demo that worked beautifully on clean markdown, then collapsed when given the client's real PDFs
Three failure modes inside one PDF: column reading order, embedded tables, scanned appendix pages
Why "just use PyPDF" produces silently broken chunks
The hidden truth: PDF is a layout language, not a content language
The cost shape: ingestion failures look like retrieval failures during debugging, which sends teams chasing the wrong fix
What this module fixes vs what it cannot fix (e.g., genuinely bad source content)