Skip to content

Evidence Data Pipelines

The chapters in this module, in reading order.

# Chapter
00 Data and Document Engineering for AI — The Five-Year-Old Version
01 Raw documents break AI systems — when the source file is not really text
02 Format boundaries and source contracts — when every file type lies differently
03 PDF, OCR, and layout parsing — when text order is reconstructed, not read
04 Tables, figures, and multimodal extraction — when the answer lives outside paragraphs
05 Metadata, lineage, and access controls — when clean text loses who, where, and whether it is allowed
06 Chunking handoff to retrieval — when ingestion owes retrieval more than text
07 Labeling and annotation operations — when messy reality becomes training rows
08 Label quality and consensus — when three humans disagree for good reasons
09 Data versioning and dataset releases — when experiments cannot be replayed
10 Pipelines, validation, and schemas — when silent data drift reaches training or retrieval
11 Synthetic data generation — when generated rows help coverage but can poison reality
12 Synthetic data quality and contamination — when fake data looks useful but teaches shortcuts
13 Active learning and data flywheels — when production misses should buy the next best label
14 Privacy, compliance, and governance — when useful data can violate trust
15 Data debugging and slice forensics — when overall metrics hide the broken patch
16 Honest admission — what data engineering still cannot guarantee