03. The warehouse that holds approved models — experiments are not deployable by default¶

~16 min read. Good experiments are plentiful. Trusted assets are rare.

Built on the ELI5 in 00-eli5.md. The warehouse — the formal home for approved models — keeps experiments separate from deployable assets.

Why the warehouse exists at all¶

Experiment tracking gives memory. That is necessary. It is not sufficient. A team can have many tracked runs and still deploy chaos.

Why? Because tracked runs are a workbench. They include baselines, dead ends, ablations, and half-successes. A production system needs something narrower. It needs approved assets. It needs clear status. It needs traceable ownership.

This is the job of the warehouse. The warehouse is not a random folder. The warehouse is not a naming convention in cloud storage. The warehouse is not a spreadsheet with good intentions. The warehouse is a control point.

It answers simple but critical questions. Which model is allowed to serve? Which candidate is under evaluation? Which one was retired? Who approved promotion? Where is the evidence?

See the boundary here. Experiment tracking stores what happened. The warehouse stores what is allowed. Experiment tracking is broad memory. The warehouse is operational judgment made concrete. Simple, no?

This separation protects teams from accidental deployment. Without the warehouse, someone can ship the wrong artifact. Without the warehouse, rollback becomes treasure hunting. Without the warehouse, audit questions become painful.

Picture first: stage flow inside the warehouse¶

┌──────────────┐ │ Draft │ │ candidate │ └──────┬───────┘ │ ▼ ┌──────────────┐ │ Staging │ │ under checks │ └──────┬───────┘ │ ▼ ┌──────────────┐ │ Production │ │ live asset │ └──────┬───────┘ │ ▼ ┌──────────────┐ │ Archived │ │ retired │ └──────────────┘

These stage names are simple. Their meaning must be strict. Draft means not approved for users. Staging means under controlled validation. Production means currently trusted for live use. Archived means retained for history, not active service.

The quality gate usually sits between these moves. Draft to Staging may require offline evidence. Staging to Production may require stronger checks. Production to Archived may happen after replacement or policy retirement. The warehouse makes these transitions explicit.

Promotion is evidence, not enthusiasm¶

Many teams promote models with vibes. A good plot. A convincing demo. A confident presenter. Look, that is not enough.

Promotion should require evidence. The quality gate should read metrics automatically. It should compare against baseline. It should inspect important slices. It should check latency and cost limits. It should verify artifact completeness.

A human may still approve the final move. That is fine. But human approval should sit on top of evidence. Not instead of evidence. The warehouse is strongest when it combines automation and accountability.

So what to do? Define promotion policy clearly. What metrics must improve? Which regressions are unacceptable? Which slices must not worsen? What documentation is required? Which owners must sign off?

If the answers live only in people's heads, the warehouse becomes decorative. If the answers live in policy and tooling, the warehouse becomes protective. This is the difference between storage and governance. Yes?

The assembly line can enforce these moves. A candidate arrives from tracked runs. The quality gate evaluates it. The warehouse records the promotion decision. The upgrade without downtime later controls live rollout. See how the placeholders cooperate.

Model cards make the asset understandable¶

A model in the warehouse needs a readable summary. That summary is often called a model card. Think of it as the asset's operating sheet. It turns technical output into team knowledge.

What should it contain? Start with the task. What does the model do? Who uses it? What decision does it influence? What is out of scope?

Add data summary. Where did training data come from? How recent was it? What important filters were used? Which user groups or regions are thinly represented? This prevents false confidence.

Add evaluation results. Overall metrics matter. Slice behavior matters more than many teams expect. Which cohorts look weaker? Which languages, geographies, or device classes perform worse? The warehouse should expose this clearly.

Add known failures. What input patterns confuse the model? What bad behavior has already been observed? What assumptions must hold upstream? Look, hiding known weakness does not remove weakness. It only makes incidents uglier.

Add latency and cost profile. How fast is inference? What hardware does it need? What is request cost or batch cost? A model that is slightly more accurate but wildly more expensive may not deserve promotion.

This card helps many roles. Engineers use it. Reviewers use it. Product leaders use it. Incident responders use it. Future you will use it most gratefully.

Registry design is control, not filing¶

Some teams think a registry is just a nicer bucket browser. No. That mindset creates weak systems. A registry should control behavior.

Control starts with identity. Every registered model version should have a stable identifier. It should link back to source runs. It should link to artifacts. It should link to evaluation reports. It should link to approvers.

Control continues with immutability. Do not mutate a registered production artifact quietly. Create a new version. Record a new decision. Protect history. The warehouse is trustworthy only when history cannot be rewritten casually.

Control also means permissions. Who can register? Who can promote? Who can archive? Who can edit documentation? These are small questions until the wrong model ships.

The warehouse should also integrate with the production monitor. When a live issue appears, responders should jump from alert to production version quickly. From there they should reach run evidence, model card, and rollback target. That is operational speed created by structure.

Simple, no? A good warehouse reduces cognitive load. It stops the team from debating where truth lives. Truth lives in the controlled record.

A practical promotion checklist¶

Before promoting from Draft to Staging, check reproducibility. Before promoting from Staging to Production, check business readiness. Both steps matter.

A practical checklist can include these items. Tracked source run exists. Artifacts are complete. Evaluation report is attached. Slice regressions are reviewed. Latency is within budget. Cost is within budget. Known failures are documented. Owner is assigned. Rollback target exists.

Notice the shape. This is not only ML performance. This is product readiness. The warehouse turns that broader view into a repeatable gate. The quality gate automates parts of it. The warehouse records the final state.

When teams skip this, they deploy candidates. When teams follow this, they deploy assets. That difference sounds small. Operationally, it is huge.

Where this lives in the wild¶

Fraud detection platform — ML platform engineer Uses stage transitions so only approved models can score payments live.
Search relevance stack — Relevance lead Compares candidates in staging before promoting one search ranker to production.
Medical imaging assist — Responsible AI reviewer Reads model cards for slice behavior and known failure modes.
Customer support copilot — Product engineer Checks latency and cost profile before approving a larger model version.
Logistics forecasting service — Data science manager Uses the warehouse to archive retired seasonal models safely.

Pause and recall¶

Why is the warehouse different from experiment tracking? What do the stages Draft, Staging, Production, and Archived mean? Why should promotion require evidence instead of enthusiasm? What must a model card include beyond headline accuracy?

Interview Q&A¶

Q: Why do we need a model registry if we already track experiments? A: Experiment tracking stores many attempts, while the warehouse records which assets are approved for operational use. A: The registry separates memory from controlled deployment state. Common wrong answer to avoid: The registry is just a backup folder for model files. Why wrong: A real registry manages stages, evidence, ownership, and promotion control.

Q: What is the purpose of stages like Draft and Production? A: Stages express operational status and gate what actions are allowed. A: They make promotion and rollback explicit. Common wrong answer to avoid: Stages are mostly labels for convenience. Why wrong: If stages do not drive policy, the registry cannot protect production.

Q: What belongs in a strong model card? A: Include task, users, data summary, evaluation, slice behavior, known failures, and latency or cost profile. A: The card should help others understand risk and fit for use. Common wrong answer to avoid: Only the best metric and a short description are enough. Why wrong: Missing context hides operational and user-facing risk.

Q: Why call the registry a control point? A: Because it governs promotion, identity, permissions, and the evidence required for deployment. A: It shapes process, not just storage. Common wrong answer to avoid: Once files are versioned, control is solved automatically. Why wrong: File versioning without policy still allows accidental or unjustified promotion.

Apply now (5 min)¶

Exercise: Take one model your team might ship. Write the minimum fields for its warehouse entry. Then list the evidence required before moving it to Production. Mark which evidence is automated today and which is manual.

Sketch from memory: Draw Draft to Staging to Production to Archived. Place the quality gate between stage changes. Write one model card field beside each stage. Circle the stage where your current team is least disciplined.

Bridge. The warehouse stores what was approved. Next, we study lineage that proves how the approved asset came to exist. → 04-reproducibility-lineage.md