Skip to content

AI Engineering Playbook

05. Layout-aware extraction — the modern toolkit

05. Layout-aware extraction — the modern toolkit¶

~14 min read. The libraries that turn a PDF into a structured document tree, not a string.

[Stub — to be written]

Outline:

Why layout matters: a heading is not body text, a footnote is not a citation
Unstructured.io — the swiss-army knife of ingestion
Marker — academic-paper-grade PDF to markdown
Nougat (Meta) — math-aware, OCR-style for papers
Docling (IBM) — newer entrant with strong layout models
LlamaParse (LlamaIndex) — managed service for hard PDFs
The DocLayNet / PubLayNet model lineage powering these
Output shapes: markdown, JSON tree, structured elements
Cost and latency comparison
When each tool is the right choice
The "good enough" decision: 80% recall fast vs 95% recall slow