02. Memory for your training runs — the run is the unit, not the file¶
~16 min read. If you cannot name the run, you do not own the result.
Built on the ELI5 in 00-eli5.md. The assembly line — the repeatable path for change — starts with a run record the line can trust.
Start with the run, not the model file¶
Teams often treat the model file as the star. That is backwards. The model file is only the final artifact. The real unit of work is the run.
A run is one attempt. One dataset snapshot. One code state. One environment. One set of features. One hyperparameter choice. One bundle of metrics. One owner.
Why does this matter? Because a model file answers almost nothing alone. It says little about how it was made. It says little about what was tried nearby. It says little about whether it was an accident.
The run gives context. The run explains the result. The run shows comparison. The run makes later automation possible. See, the assembly line cannot reason from a random file in a folder. The assembly line needs structured memory.
When a teammate asks, which version should we trust? The answer should not be a chat screenshot. It should be a run URL. It should include evidence. It should include artifacts. It should include ownership. Simple, no?
This is also how the quality gate gets sharper. The quality gate evaluates candidates. Candidates come from runs. If run metadata is weak, the quality gate becomes blind. If run metadata is rich, the quality gate can compare honestly.
What every run should log¶
Start with identity. Log the run name. Log the owner. Log the timestamp. Log the purpose. Was this baseline, tuning, ablation, or recovery?
Log the code commit. Without the commit, you are already guessing. Log the branch if useful. Log important config files. Log pipeline version if training is orchestrated.
Log the dataset version. This is non-negotiable. Which snapshot was used? How large was it? What filters were applied? Were labels refreshed? Was there leakage screening?
Log the feature definition. Which columns entered training? Which joins were used? Which transformations ran? Which feature store version, if any, supplied them? Look, same feature names do not guarantee same feature meaning.
Log hyperparameters. Learning rate. Batch size. Context length. Regularization. Seed. Sampling temperature when relevant. These details look boring until they save you.
Log the environment. Container image. Python version. Library versions. CUDA stack when needed. Hardware type. If the environment changes, the result may move.
Log metrics. Overall metrics matter. Slice metrics matter more than many teams admit. Latency from offline inference tests may matter too. Cost per batch or per request can matter. The future warehouse should receive this evidence cleanly.
Log artifacts. Model weights. Tokenizer. Feature schema. Evaluation report. Confusion matrix or ranking plots. Prompt templates when your system uses them. Artifacts make the run inspectable, not mythical.
Picture first: what a tracked run actually connects¶
┌──────────────┐ │ code commit │ └──────┬───────┘ │ ┌──────▼───────┐ │ data version │ └──────┬───────┘ │ ┌──────▼───────┐ │ train run │ ├──────────────┤ │ params │ │ metrics │ │ artifacts │ │ owner │ └──────┬───────┘ │ ▼ ┌──────────────┐ │ warehouse │ └──────────────┘
That middle box is the key. Not the file alone. The tracked run is the object you can debate. The tracked run is the object the assembly line can move. The tracked run is the object the warehouse can later reference.
Tool choices: MLflow, W&B, and managed platforms¶
MLflow is a practical default for many teams. It is open and familiar. It covers runs, artifacts, and the warehouse well. It fits nicely when you want control over hosting.
Weights & Biases is strong when collaboration needs more polish. Dashboards are rich. Reports are easy to share. Comparison across many runs is smooth. It often feels friendlier for fast-moving experiment teams.
Managed options matter too. Vertex AI Experiments can fit teams already on Google Cloud. SageMaker experiments fit AWS-heavy shops. Azure ML tracks runs inside its broader platform. These options reduce some setup work. They also pull you deeper into one cloud shape.
So what to do? Choose based on operating context. Pick the tool your team will actually use daily. A mediocre tool used consistently beats a strong tool ignored casually. Yes?
Ask five boring questions before choosing. Who will own the platform? How much customization do you need? How important are rich dashboards? How tightly do you want to bind to one cloud? How easily can the tool feed your quality gate and warehouse?
Do not turn this into a religious debate. The goal is durable memory. The goal is searchable evidence. The goal is faster debugging. The goal is a healthier assembly line.
Log failed runs too¶
This habit is underrated. Many teams log only pretty results. That is a mistake. Wisdom hides in failures.
A failed run may reveal instability. A failed run may reveal a bad data assumption. A failed run may reveal a broken feature join. A failed run may reveal a cost explosion. A failed run may reveal that the metric was easy to game.
When failures disappear, the team repeats them. New members retry dead ends. Incident reviews become shallow. Hyperparameter search looks cleaner than reality. Risk feels smaller than it really is.
Failed runs also calibrate the quality gate. The quality gate should know what weak looks like. The quality gate should know what suspicious improvement looks like. Without failed examples, the decision boundary becomes naive.
Look, factories keep defect records. They do not store only perfect batches. The same logic applies here. Your assembly line improves by seeing misses clearly. Your warehouse becomes more trustworthy when promotion history is grounded.
Log failure reason plainly. Use labels like data-error, infra-error, metric-regression, timeout, or bad-schema. Write one sentence on what happened. That single sentence will save hours later. Simple, no?
Minimum team habit for experiment tracking¶
Create a run for every training attempt that could influence a decision. That is the floor. Not the ceiling.
Require links in code review or model review. If someone proposes a model, they must attach the run. If someone asks for promotion, they must attach the run. If someone says it is better, they must attach the run.
Standardize the schema. Do not let every person invent fields weekly. Pick mandatory metadata. Pick allowed status values. Pick naming rules. Pick artifact locations. Consistency makes search useful.
Connect run tracking to the warehouse. Promotion should point back to one or more approved runs. Connect run tracking to the quality gate. Promotion checks should read run metrics automatically. Connect run tracking to the production monitor. When live behavior changes, start investigation from the source run.
That is the memory loop. Run memory helps launch. Run memory helps debug. Run memory helps explain. Run memory helps recover. See, boring memory creates fast teams.
Where this lives in the wild¶
-
Recommendation systems — Applied scientist Compares ranking runs and links the winner to offline and online evidence.
-
Credit scoring platform — ML engineer Tracks dataset version, features, and fairness slices for every model attempt.
-
LLM product assistant — AI engineer Logs prompt templates, retrieval settings, latency, and cost per run.
-
Computer vision inspection line — MLOps engineer Stores training artifacts and failed runs to debug seasonal drift.
-
Ads bidding stack — ML platform lead Uses run metadata to automate promotion through the quality gate.
Pause and recall¶
Why is the run a better unit of work than the model file? Which metadata fields are non-negotiable for a tracked run? Why should failed runs stay visible? How do the assembly line and the warehouse depend on run records?
Interview Q&A¶
Q: What should be the primary object in experiment tracking? A: The run should be primary because it captures context, evidence, and artifacts together. A: The model file is just one output of that run. Common wrong answer to avoid: The saved weights file is enough if we keep the latest one. Why wrong: Latest weights without context cannot support comparison, debugging, or promotion.
Q: What must every serious run log? A: At minimum log code commit, dataset version, features, hyperparameters, environment, metrics, artifacts, and owner. A: Those fields form a usable chain of evidence. Common wrong answer to avoid: Only final accuracy and the model path matter. Why wrong: Accuracy alone cannot explain how the result was produced or whether it is repeatable.
Q: How would you compare MLflow and W&B briefly? A: MLflow is a solid open workflow for tracking, artifacts, and registry needs. A: W&B is strong for collaboration, reporting, and polished experiment comparison. Common wrong answer to avoid: One tool is objectively best for every team. Why wrong: Tool fit depends on team workflow, platform ownership, and integration needs.
Q: Why log failed runs if they will never ship? A: Failed runs capture negative knowledge, reveal weak assumptions, and prevent repeated mistakes. A: They also help sharpen future quality gate logic. Common wrong answer to avoid: Failed runs just create clutter. Why wrong: Clean-looking history without failures often produces expensive amnesia.
Apply now (5 min)¶
Exercise: Pick one recent training attempt from memory. Write the minimum run record it should have had. Include code, data, features, metrics, artifacts, and owner. Then mark what you would be unable to reconstruct today.
Sketch from memory: Draw the assembly line feeding from tracked runs. Add the quality gate in front of promotion. Add the warehouse after approval. Circle the metadata field your current team forgets most often.
Bridge. Tracking runs is not enough. Approved models need a formal home and controlled promotion path. → 03-model-registry.md