03. PDF parsing strategies — native text vs reconstructed text¶

~15 min read. The biggest production file type. The hardest to parse well.

[Stub — to be written]

Outline:

The two PDF families: native (text layer present) vs scanned (image-only)
Detecting which family a PDF belongs to — and why the detection itself can be unreliable
PyPDF / pdfplumber / pdfminer.six — what each is best at
pymupdf (Fitz) — the speed/quality default for native PDFs
Reading order: why text comes out scrambled and how to fix it
Multi-column documents: detection heuristics
Headers, footers, page numbers — stripping vs preserving
Hyperlinks and references inside the document
When to give up on native parsing and escalate to layout-aware (chapter 05) or vision (chapter 07)