Skip to content

04. Full chain of evidence — reproducibility is a stack, not a wish

~17 min read. If you cannot rebuild the result, you cannot truly trust it.

Built on the ELI5 in 00-eli5.md. The warehouse — the formal record of approved assets — is strongest when lineage proves how each asset was made.


Reproducibility is five layers deep

Teams often say, we can reproduce it. Then they mean only one thing. They still have the code. That is much too shallow.

Real reproducibility is a stack. Five layers matter. Code. Data. Environment. Execution. Evaluation.

Start with code. You need the exact commit or equivalent snapshot. Not almost the same notebook. Not a branch that later moved. The exact training logic matters.

Then data. Which dataset snapshot was used? Which filters ran? Which labels existed then? Which rows were excluded? Without data versioning, code alone cannot rescue you.

Then environment. Library versions change behavior. CUDA stacks change behavior. Tokenizers change behavior. Small dependency shifts can move results quietly. Look, environment drift is real drift too.

Then execution. Which command ran? Which config values were passed? Which seed was set? Which machine or cluster shape executed the job? Execution details explain why nominally same ingredients still differ.

Then evaluation. What metrics were computed? On which test split? With which prompts, thresholds, or post-processing rules? A model is not reproducible if its reported score cannot be rebuilt. Simple, no?

When teams miss one layer, they often notice only during incidents. That is too late. The assembly line depends on reproducibility. The quality gate depends on comparable evidence. The warehouse depends on traceable inputs.

Picture first: what lineage actually connects

┌──────────────┐ │ data snapshot│ └──────┬───────┘ │ ┌──────▼───────┐ │ code commit │ └──────┬───────┘ │ ┌──────▼───────┐ │ environment │ │ container │ └──────┬───────┘ │ ┌──────▼───────┐ │ execution │ │ run record │ └──────┬───────┘ │ ┌──────▼───────┐ │ evaluation │ │ evidence │ └──────┬───────┘ │ ▼ ┌──────────────┐ │ model asset │ │ in warehouse │ └──────────────┘

That full path is lineage. Lineage is not one link. It is the chain from input material to approved asset. If one link is weak, incident response slows down. If several links are weak, trust becomes theater.

Version four things together, not separately in isolation

Many teams version code and stop there. Some version code and model. Better, but still incomplete. The safer discipline is versioning four things together. Code. Data. Features. Model.

Why features explicitly? Because features are where train-serve mismatch loves to hide. A join logic change can alter meaning without changing column names. A scaling rule can shift values silently. A backfill policy can leak future information. The feature layer deserves first-class treatment.

So what to do? When a run is created, capture references for all four. Which code commit? Which data snapshot? Which feature definition version? Which resulting model artifact version? That quartet should travel together.

This is the backbone of lineage. The warehouse later stores the approved model version. But the warehouse should point back to the other three components too. Otherwise you know what shipped, but not what created it.

The quality gate also benefits. A metric regression may come from code. Or from data. Or from features. Or from evaluation rules. With versioned components, investigation narrows fast. Without them, people argue from intuition.

DVC and dataset discipline

Data versioning is often the weakest layer. Files move. Queries change. Labels refresh. People assume someone else kept the exact snapshot. Often, nobody did.

Tools like DVC help because they treat data changes seriously. DVC can point code revisions to specific dataset states. It can connect storage and reproducible pipelines. It is not magic, but it builds discipline. See, discipline is the real product here.

You may use DVC. You may use lakehouse snapshotting. You may use managed dataset version features. The exact tool matters less than the rule. The rule is this. Training data must be referenceable later.

That also means documenting extraction logic. If your dataset came from a SQL job, store the query or pipeline version. If labels came from human review, capture the label snapshot date. If sampling rules changed, record them. Data lineage without transformation lineage is half lineage.

Do not forget feature data. If features were materialized, version that materialization. If features were computed on demand, version the transformation code and config. Look, same raw data plus changed feature logic is different training input. Simple, no?

Artifact storage rules prevent future pain

Artifact storage sounds boring. Good. Boring rules save teams.

Store artifacts in durable, access-controlled locations. Do not depend on one engineer's laptop. Do not depend on unnamed folders. Do not depend on mutable latest paths. Immutability is your friend here.

Give each artifact a stable identifier. Link that identifier to the run. Link the run to evaluation evidence. Link approved artifacts to the warehouse. That is how the chain stays navigable.

Retain the right bundle. Model weights. Tokenizer or vocabulary. Feature schema. Preprocessing assets. Evaluation outputs. Config files. If serving depends on it, store it.

Package environment definitions too. Container image reference. Dependency lock file. Hardware notes when relevant. The assembly line should be able to rebuild or validate from stored definitions. If rebuild is impossible, promotion should be harder.

The production monitor also benefits from clean storage rules. When an alert fires, responders should jump from live version to exact artifacts quickly. That speed reduces downtime and blame-driven guessing. Yes?

Reproducibility enables automation instead of blocking it

Some people hear reproducibility and imagine paperwork. That is the wrong picture. Reproducibility is what lets automation be safe.

The assembly line can retrain predictably only when inputs are pinned. The quality gate can compare fairly only when evidence is consistent. The warehouse can govern honestly only when artifact identity is stable. The upgrade without downtime is safer when rollback targets are exact. See how every placeholder leans on lineage.

This is why reproducibility is not an afterthought. It is a platform capability. It is a debugging capability. It is a governance capability. It is a speed capability in disguise.

Look, unreliable rebuilding makes every later step manual. Reliable rebuilding lets the team automate without fear. That is the real payoff. Not philosophical purity. Operational confidence.


Where this lives in the wild

  • Demand forecasting platform — Data engineer Pins dataset snapshots and retrains from versioned pipelines during incident review.

  • Search ranking team — ML engineer Tracks feature definitions separately to catch train-serve mismatch early.

  • LLM evaluation platform — AI platform engineer Versions prompts, containers, and judge configs alongside model artifacts.

  • Industrial defect detection — MLOps engineer Stores image dataset revisions and preprocessing assets for auditability.

  • Credit risk service — Model governance lead Requires lineage links before any asset can enter the warehouse.


Pause and recall

What are the five layers of reproducibility? Why should code, data, features, and model be versioned together? How does lineage help the quality gate and the warehouse? What problem does DVC or similar tooling solve?


Interview Q&A

Q: Why say reproducibility is a stack, not a wish? A: Because reliable rebuilding needs aligned code, data, environment, execution, and evaluation layers. A: Missing one layer can break the outcome. Common wrong answer to avoid: If we saved the training script, reproducibility is covered. Why wrong: Scripts without pinned data, environment, and evaluation still leave major uncertainty.

Q: What is lineage in practical terms? A: Lineage is the end-to-end chain linking data snapshot, code, environment, run, evidence, and resulting model artifact. A: It explains how a shipped asset came to exist. Common wrong answer to avoid: Lineage is just another word for file versioning. Why wrong: File versioning is only one piece of the broader evidence chain.

Q: Why version features separately from data and code? A: Feature logic often changes meaning without obvious surface changes, creating train-serve mismatch risk. A: Separate versioning makes those changes visible. Common wrong answer to avoid: Feature versioning is redundant because features come from the data anyway. Why wrong: Transformation logic is its own moving part and can alter behavior materially.

Q: What role does DVC play in an MLOps stack? A: DVC helps teams reference, version, and reproduce dataset states and pipelines in a disciplined way. A: It strengthens the data layer of lineage. Common wrong answer to avoid: DVC stores everything automatically so process discipline is no longer needed. Why wrong: Tooling helps, but teams still need clear rules for snapshots, metadata, and retention.


Apply now (5 min)

Exercise: Pick one model already in use or under review. List the five reproducibility layers for it. Mark the weakest layer honestly. Then write one rule that would strengthen that layer this week.

Sketch from memory: Draw the lineage chain from data snapshot to warehouse entry. Add the assembly line above it. Add the quality gate before approval. Underline the link your current team would struggle to reconstruct today.


Bridge. Reproducibility makes automation trustworthy. Next, we move from evidence to motion with CI and CD for ML. → 05-cicd-for-ml.md