05. Layout-aware extraction — the modern toolkit¶
~14 min read. The libraries that turn a PDF into a structured document tree, not a string.
[Stub — to be written]
Outline:
- Why layout matters: a heading is not body text, a footnote is not a citation
- Unstructured.io — the swiss-army knife of ingestion
- Marker — academic-paper-grade PDF to markdown
- Nougat (Meta) — math-aware, OCR-style for papers
- Docling (IBM) — newer entrant with strong layout models
- LlamaParse (LlamaIndex) — managed service for hard PDFs
- The DocLayNet / PubLayNet model lineage powering these
- Output shapes: markdown, JSON tree, structured elements
- Cost and latency comparison
- When each tool is the right choice
- The "good enough" decision: 80% recall fast vs 95% recall slow