Skip to content

03. PDF parsing strategies — native text vs reconstructed text

~15 min read. The biggest production file type. The hardest to parse well.


[Stub — to be written]

Outline:

  • The two PDF families: native (text layer present) vs scanned (image-only)
  • Detecting which family a PDF belongs to — and why the detection itself can be unreliable
  • PyPDF / pdfplumber / pdfminer.six — what each is best at
  • pymupdf (Fitz) — the speed/quality default for native PDFs
  • Reading order: why text comes out scrambled and how to fix it
  • Multi-column documents: detection heuristics
  • Headers, footers, page numbers — stripping vs preserving
  • Hyperlinks and references inside the document
  • When to give up on native parsing and escalate to layout-aware (chapter 05) or vision (chapter 07)