Evidence Data Pipelines
The chapters in this module, in reading order.
| # |
Chapter |
| 00 |
Data and Document Engineering for AI — The Five-Year-Old Version |
| 01 |
Raw documents break AI systems — when the source file is not really text |
| 02 |
Format boundaries and source contracts — when every file type lies differently |
| 03 |
PDF, OCR, and layout parsing — when text order is reconstructed, not read |
| 04 |
Tables, figures, and multimodal extraction — when the answer lives outside paragraphs |
| 05 |
Metadata, lineage, and access controls — when clean text loses who, where, and whether it is allowed |
| 06 |
Chunking handoff to retrieval — when ingestion owes retrieval more than text |
| 07 |
Labeling and annotation operations — when messy reality becomes training rows |
| 08 |
Label quality and consensus — when three humans disagree for good reasons |
| 09 |
Data versioning and dataset releases — when experiments cannot be replayed |
| 10 |
Pipelines, validation, and schemas — when silent data drift reaches training or retrieval |
| 11 |
Synthetic data generation — when generated rows help coverage but can poison reality |
| 12 |
Synthetic data quality and contamination — when fake data looks useful but teaches shortcuts |
| 13 |
Active learning and data flywheels — when production misses should buy the next best label |
| 14 |
Privacy, compliance, and governance — when useful data can violate trust |
| 15 |
Data debugging and slice forensics — when overall metrics hide the broken patch |
| 16 |
Honest admission — what data engineering still cannot guarantee |