06. Table extraction — the hardest part of PDF ingestion¶
~12 min read. Tables carry the numbers. Numbers are usually why someone built the RAG in the first place.
[Stub — to be written]
Outline:
- Why tables break naive text extraction: a row is not a paragraph
- pdfplumber's table detection — when it works, when it doesn't
- camelot vs tabula — the two classical extractors
- Layout-model approaches: Microsoft Table Transformer, Tabnet
- Vision-LLM table extraction — surprisingly strong, often the new default
- Output formats: HTML, markdown, CSV, JSON — what downstream wants
- Merged cells, multi-row headers, nested tables
- Preserving the table's surrounding context (caption, footnote)
- How tables flow into chunking: keep whole or split rows
- Numeric-fidelity check: a hallucinated digit is worse than a missing table