Skip to content

06. Table extraction — the hardest part of PDF ingestion

~12 min read. Tables carry the numbers. Numbers are usually why someone built the RAG in the first place.


[Stub — to be written]

Outline:

  • Why tables break naive text extraction: a row is not a paragraph
  • pdfplumber's table detection — when it works, when it doesn't
  • camelot vs tabula — the two classical extractors
  • Layout-model approaches: Microsoft Table Transformer, Tabnet
  • Vision-LLM table extraction — surprisingly strong, often the new default
  • Output formats: HTML, markdown, CSV, JSON — what downstream wants
  • Merged cells, multi-row headers, nested tables
  • Preserving the table's surrounding context (caption, footnote)
  • How tables flow into chunking: keep whole or split rows
  • Numeric-fidelity check: a hallucinated digit is worse than a missing table