Skip to content

AI Engineering Playbook

06. Table extraction — the hardest part of PDF ingestion

06. Table extraction — the hardest part of PDF ingestion¶

~12 min read. Tables carry the numbers. Numbers are usually why someone built the RAG in the first place.

[Stub — to be written]

Outline:

Why tables break naive text extraction: a row is not a paragraph
pdfplumber's table detection — when it works, when it doesn't
camelot vs tabula — the two classical extractors
Layout-model approaches: Microsoft Table Transformer, Tabnet
Vision-LLM table extraction — surprisingly strong, often the new default
Output formats: HTML, markdown, CSV, JSON — what downstream wants
Merged cells, multi-row headers, nested tables
Preserving the table's surrounding context (caption, footnote)
How tables flow into chunking: keep whole or split rows
Numeric-fidelity check: a hallucinated digit is worse than a missing table