01. Raw documents break AI systems — when the source file is not really text¶
~14 min read. Raw customer files contain layout, scans, tables, figures, permissions, and stale copies that a model cannot repair later. This page uses a source-room trace showing file type, text layer, layout blocks, tables, metadata, permissions, and downstream chunk quality to show the failure before it reaches the model.
Built on 00-eli5.md. the source room is the recurring object here: the part of the evidence supply chain that makes this failure observable.
1) The wall — when the source file is not really text¶
The running system looks normal until raw customer files contain layout, scans, tables, figures, permissions, and stale copies that a model cannot repair later. The model may still produce a fluent answer, and the retriever may still return something that looks related, but the evidence substrate is already compromised.
The artifact to inspect is a source-room trace showing file type, text layer, layout blocks, tables, metadata, permissions, and downstream chunk quality. If that artifact does not exist, the team is debugging from the final answer backward, which is the slowest possible path.
The tempting repair is to feed every file to the same parser and debug retrieval later. That can work for demo data. It fails on production inputs because the source, label, metadata, or validation boundary has already lost information.
Concrete case: A vendor PDF has two columns, a scanned appendix, and a pricing table. The parser emits sentences in the wrong order, skips the appendix, and flattens the table into prose. Retrieval looks bad, but ingestion broke first.
Root cause: source truth must be made machine-readable before retrieval can be trusted.
Mini-FAQ. "Why not fix this in the prompt?" Because the prompt sees the evidence after this failure has already happened. Prompts can ask for caution; they cannot recover source structure, lineage, labels, or permissions that were never preserved.
2) The core visual — source files become evidence only after extraction¶
The useful mental model is a supply chain, not a folder of files.
raw source
│
▼
parse / label / validate
│
▼
evidence artifact
│
▼
training, retrieval, eval, or serving
For this chapter, the load-bearing artifact is a source-room trace showing file type, text layer, layout blocks, tables, metadata, permissions, and downstream chunk quality. It is the object a senior engineer asks for during a review because it proves whether the data path is trustworthy.
Without it, the data path becomes a belief. With it, the system has something to test, diff, release, and roll back.
3) How the mechanism works in the pipeline¶
First, identify the boundary where the raw world becomes machine evidence. In this chapter that boundary is the source room.
Second, preserve the intermediate state. Do not only store the final text, row, label, or chunk. Store the parser, version, source pointer, quality status, and decision that produced it.
Third, attach a quality signal before downstream systems consume the result. A retriever, trainer, or eval harness should not have to guess whether the data is safe.
Fourth, make the failure replayable. When a bad answer appears in production, the team should be able to walk backward from answer to chunk to source to parser or label decision.
That replay path is the difference between data engineering and data hope.
4) The worked example — inspect a source-room trace showing file type, text layer, layout blocks, tables, metadata, permissions, and downstream chunk quality¶
| Step | What the team records | What can break |
|---|---|---|
| Source | original file, row, owner, timestamp | wrong source, stale source, missing permission |
| Transform | parser, label spec, validation rule, version | silent extraction or labeling change |
| Artifact | a source-room trace showing file type, text layer, layout blocks, tables, metadata, permissions, and downstream chunk quality | missing lineage or quality signal |
| Consumer | retrieval, training, eval, serving | downstream system trusts bad evidence |
Now run the example.
The naive path says: feed every file to the same parser and debug retrieval later.
The inspected path asks for a source-room trace showing file type, text layer, layout blocks, tables, metadata, permissions, and downstream chunk quality, checks whether bad-answer tickets trace back to parser artifacts instead of retriever settings, and blocks or routes the data when the signal fails.
That is why the first metric is ingestion-defect rate. It measures the data failure before it is disguised as a model failure.
5) Failure modes — how this data control breaks¶
- The artifact is not logged, so the bad answer cannot be traced back to source data.
- The data looks clean in aggregate but fails on one format, tenant, annotator cohort, or time window.
- The pipeline stores final text but drops layout, source identity, permissions, or label policy.
- A quality threshold is copied from a demo corpus and never calibrated on production traffic.
- The downstream model improves on easy examples while the target slice regresses.
- Operators debug embeddings, prompts, or model choice when the upstream evidence is already broken.
6) Production rules that hold up¶
Store the intermediate artifact, not just the final output.
Version the rule that created it.
Attach a quality signal close to the source of failure.
Sample real production misses, not only benchmark rows.
Design rollback around data releases, not only model releases.
If ingestion-defect rate gets worse, stop the release before the failure reaches training or retrieval.
7) Why not tuning embeddings on broken chunks under this workload¶
Tuning embeddings on broken chunks is attractive because it keeps the pipeline smaller. It is often acceptable for low-risk, clean, internal data.
It becomes unsafe when raw customer files contain layout, scans, tables, figures, permissions, and stale copies that a model cannot repair later. At that point, the system needs an explicit artifact and quality signal, not another downstream patch.
| Option | Works when | Fails when | Cost moves to |
|---|---|---|---|
| tuning embeddings on broken chunks | inputs are clean and low-risk | raw customer files contain layout, scans, tables, figures, permissions, and stale copies that a model cannot repair later | model debugging and user trust |
| this control | the failure is visible before consumption | artifacts are unlogged or ignored | pipeline design, validation, review |
8) Production signals — know whether this data control is working¶
Healthy behavior: bad-answer tickets trace back to parser artifacts instead of retriever settings.
First metric to watch: ingestion-defect rate.
Misleading metric: final answer fluency or aggregate accuracy. Both can look fine while one source slice is corrupt.
Expert graph: failure rate by source type, parser version, label policy, dataset release, and consumer route.
bad answer
-> evidence artifact
-> source / transform / quality signal
-> responsible owner
-> data fix or honest escalation
9) Boundary — where this control helps, hurts, or wastes budget¶
Strong fit: the failure can be observed before data reaches training, retrieval, eval, or serving.
Weak fit: the source truth is missing, contradictory, private, or controlled by another organization.
Pathology: the team adds more checks but no owner can act on failed checks.
Scale limit: every check spends latency, money, storage, and review attention. Route expensive inspection to high-risk sources and sample the rest.
10) Wrong model — data quality is just cleaning text¶
The wrong model says data engineering is a cleanup step before the real AI work begins.
The better model says data engineering is the evidence operating system. It decides what is allowed to become training signal, retrieval evidence, eval ground truth, or production feedback.
If the source room does not create a decision, it is documentation, not engineering.
11) Failure taxonomy for this data control¶
- Source failure — the raw file, row, or system of record is missing or stale.
- Extraction failure — text, layout, table, image, or metadata is corrupted.
- Label failure — policy, instructions, reviewer quality, or ambiguity is unresolved.
- Validation failure — bad data passes a gate because the gate is missing or too weak.
- Governance failure — permission, consent, retention, or privacy boundaries are violated.
- Feedback failure — production errors are collected but never become prioritized data work.
12) Pattern transfer — same pressure, different system¶
- Retrieval has the same shape: the top result is only useful if the evidence path is trustworthy.
- MLOps has the same shape: releases need versioned inputs, not only versioned models.
- Incident response has the same shape: debugging starts from a symptom and walks backward through artifacts.
- Security has the same shape: boundaries must be enforced before untrusted data enters a trusted path.
13) Design review checklist¶
- What source or data boundary does this chapter protect?
- What artifact proves the boundary was handled correctly?
- Why is tuning embeddings on broken chunks weaker under this workload?
- Which dashboard would show ingestion-defect rate getting worse?
- Who owns the fix when the quality gate fails?
- What downstream system should be blocked, rerouted, or warned?
Where this lives in the wild¶
- Enterprise RAG platforms — uses this pattern when evidence must be trustworthy before model consumption.
- Legal discovery tools — uses this pattern when evidence must be trustworthy before model consumption.
- Healthcare document assistants — uses this pattern when evidence must be trustworthy before model consumption.
- Financial research copilots — uses this pattern when evidence must be trustworthy before model consumption.
- Insurance claim pipelines — uses this pattern when evidence must be trustworthy before model consumption.
- Customer support analytics — uses this pattern when evidence must be trustworthy before model consumption.
- Internal search systems — uses this pattern when evidence must be trustworthy before model consumption.
- Fine-tuning data platforms — uses this pattern when evidence must be trustworthy before model consumption.
- Human annotation vendors — uses this pattern when evidence must be trustworthy before model consumption.
- Synthetic data generators — uses this pattern when evidence must be trustworthy before model consumption.
- Data catalogs — uses this pattern when evidence must be trustworthy before model consumption.
- Feature stores — uses this pattern when evidence must be trustworthy before model consumption.
- Data validation frameworks — uses this pattern when evidence must be trustworthy before model consumption.
- Privacy review systems — uses this pattern when evidence must be trustworthy before model consumption.
- Incident debugging notebooks — uses this pattern when evidence must be trustworthy before model consumption.
Recall checkpoint¶
- What concrete pressure makes this mechanism necessary?
- What artifact would you ask for during a production debug review?
- Why does tuning embeddings on broken chunks fail on the worked example?
- Which slice or source type is most likely to hide this failure?
- What does ingestion-defect rate tell you before model metrics move?
- When should the system stop instead of passing data downstream?
Interview Q&A¶
Q: Why is this an AI-engineering problem instead of a generic data problem? A: Because the failure changes model behavior, retrieval evidence, eval truth, or production trust. AI systems amplify upstream data mistakes.
Common wrong answer to avoid: "The model should handle it." — The model only sees what the data path preserved and allowed through.
Q: What artifact would you inspect first? A: I would inspect a source-room trace showing file type, text layer, layout blocks, tables, metadata, permissions, and downstream chunk quality, then walk backward to source, transform, version, and owner.
Common wrong answer to avoid: "The model should handle it." — The model only sees what the data path preserved and allowed through.
Q: What is the common misleading fix? A: Teams often try to feed every file to the same parser and debug retrieval later, but that treats the symptom downstream.
Common wrong answer to avoid: "The model should handle it." — The model only sees what the data path preserved and allowed through.
Q: How do you decide whether the extra pipeline cost is worth it? A: Use the risk and frequency of the failure, then track ingestion-defect rate by source slice.
Common wrong answer to avoid: "The model should handle it." — The model only sees what the data path preserved and allowed through.
Q: What makes the mechanism production-ready? A: It is logged, versioned, sampled, monitored, owned, and tied to a release or rollback decision.
Common wrong answer to avoid: "The model should handle it." — The model only sees what the data path preserved and allowed through.
Q: When should the system admit uncertainty? A: When the source truth is missing, contradictory, private, or not recoverable by more processing.
Common wrong answer to avoid: "The model should handle it." — The model only sees what the data path preserved and allowed through.
Apply now (10 min)¶
- Pick one real data source and sketch the artifact you would need here.
- List three failure slices and one metric for each. Start with ingestion-defect rate.
- Reproduce from memory: explain the source pressure, mechanism, artifact, quality signal, and boundary in five sentences.
What you should remember¶
This chapter exists because raw customer files contain layout, scans, tables, figures, permissions, and stale copies that a model cannot repair later. It turns a vague “data quality” worry into an artifact, metric, owner, and decision.
The artifact to inspect is a source-room trace showing file type, text layer, layout blocks, tables, metadata, permissions, and downstream chunk quality. If that artifact does not exist, the team will debug the model when it should debug the evidence path.
Remember:
- Source truth must be made machine-readable before retrieval can be trusted.
- Watch ingestion-defect rate before trusting downstream model metrics.
- Store intermediate artifacts, not only final text or rows.
- Version data releases the same way you version model releases.
- Data engineering is the evidence operating system for AI.
Bridge. This chapter made the failure inspectable. Next we move to format boundaries and source contracts, where the same evidence supply chain faces a new constraint. → 02-format-boundaries-and-source-contracts.md