Evidence Data Pipelines¶

The chapters in this module, in reading order.

#	Chapter
00	Data and Document Engineering for AI — The Five-Year-Old Version
01	Raw documents break AI systems — when the source file is not really text
02	Format boundaries and source contracts — when every file type lies differently
03	PDF, OCR, and layout parsing — when text order is reconstructed, not read
04	Tables, figures, and multimodal extraction — when the answer lives outside paragraphs
05	Metadata, lineage, and access controls — when clean text loses who, where, and whether it is allowed
06	Chunking handoff to retrieval — when ingestion owes retrieval more than text
07	Labeling and annotation operations — when messy reality becomes training rows
08	Label quality and consensus — when three humans disagree for good reasons
09	Data versioning and dataset releases — when experiments cannot be replayed
10	Pipelines, validation, and schemas — when silent data drift reaches training or retrieval
11	Synthetic data generation — when generated rows help coverage but can poison reality
12	Synthetic data quality and contamination — when fake data looks useful but teaches shortcuts
13	Active learning and data flywheels — when production misses should buy the next best label
14	Privacy, compliance, and governance — when useful data can violate trust
15	Data debugging and slice forensics — when overall metrics hide the broken patch
16	Honest admission — what data engineering still cannot guarantee