03. PDF parsing strategies — native text vs reconstructed text¶
~15 min read. The biggest production file type. The hardest to parse well.
[Stub — to be written]
Outline:
- The two PDF families: native (text layer present) vs scanned (image-only)
- Detecting which family a PDF belongs to — and why the detection itself can be unreliable
- PyPDF / pdfplumber / pdfminer.six — what each is best at
- pymupdf (Fitz) — the speed/quality default for native PDFs
- Reading order: why text comes out scrambled and how to fix it
- Multi-column documents: detection heuristics
- Headers, footers, page numbers — stripping vs preserving
- Hyperlinks and references inside the document
- When to give up on native parsing and escalate to layout-aware (chapter 05) or vision (chapter 07)