Skip to content

05. Layout-aware extraction — the modern toolkit

~14 min read. The libraries that turn a PDF into a structured document tree, not a string.


[Stub — to be written]

Outline:

  • Why layout matters: a heading is not body text, a footnote is not a citation
  • Unstructured.io — the swiss-army knife of ingestion
  • Marker — academic-paper-grade PDF to markdown
  • Nougat (Meta) — math-aware, OCR-style for papers
  • Docling (IBM) — newer entrant with strong layout models
  • LlamaParse (LlamaIndex) — managed service for hard PDFs
  • The DocLayNet / PubLayNet model lineage powering these
  • Output shapes: markdown, JSON tree, structured elements
  • Cost and latency comparison
  • When each tool is the right choice
  • The "good enough" decision: 80% recall fast vs 95% recall slow