01. ML lifecycle overview¶

⏱️ Estimated time: 18 min | Level: intermediate

ELI5 callback: In our chain, the kitchen trains, the prep station prepares, the recipe book stores, the serving counter serves, and the quality inspector checks. Same restaurant chain, different platform layer. See.

Start with the loop, not the tool¶

Start with the business question, not the benchmark. Ask which decision improves when the model becomes available. Map the user moment, the data source, and the failure cost. See. A lifecycle is a loop, not a one-way project plan. The kitchen matters only after the use case is sharp. The prep station exists because raw columns rarely help directly. The recipe book prevents model files from becoming mystery objects. The serving counter turns scores into a product experience. The quality inspector closes the loop after launch. So what to do? Define owners for each stage early. One team can own many stages, but ownership must stay visible. Without that clarity, handoffs become the real bottleneck. Simple, no? The loop is technical and operational together.

┌──────────┐   ┌──────────┐   ┌──────────┐
│ Problem  │→→│  Data    │→→│ Training │
└──────────┘   └──────────┘   └──────────┘
       ↓                         ↓
┌──────────┐   ┌──────────┐   ┌──────────┐
│ Monitor  │←←│ Deploy   │←←│ Evaluate │
└──────────┘   └──────────┘   └──────────┘

Good platforms compress this loop safely.
Weak platforms make every iteration manual.
Fast iteration beats heroic one-time training.
Document entry and exit criteria for each stage.
Keep cost, latency, and quality visible from day one.
Now watch. Every next file zooms into one box.

Frame the problem before touching data¶

A recommendation model and a fraud model need different framing. Recommendation optimizes delight, engagement, and freshness together. Fraud optimization must price false positives very carefully. Write the prediction target in one plain sentence. Write the action taken after prediction in one plain sentence. Write the cost of a wrong prediction in one plain sentence. See. These three sentences remove half the confusion. Now separate offline proxy metrics from business outcomes. AUC may rise while conversion falls. It happens. So what to do? Keep a business metric beside every model metric. Also define freshness needs early. A daily batch predictor and a millisecond scorer are different systems. Do not promise real time when batch is enough.

┌───────────────┐
│ User decision │
└──────┬────────┘
       │ needs
       v
┌───────────────┐
│ Prediction job│

Clarify objective, action, and cost together.
Pick one primary metric and a few guardrails.
Define latency and freshness before architecture.
Reject use cases with unclear decision hooks.
Good framing saves more money than bigger models.
See. Problem framing is architecture work too.

Data, features, and training must align¶

Training data should mirror production reality as closely as possible. Label generation deserves the same care as model code. Leaked labels create fake confidence and painful rollbacks. Feature definitions must be stable across time windows. Point-in-time correctness matters whenever labels lag events. Now watch. The same SQL can be right and still be wrong. If you join future values into past examples, evaluation becomes fiction. Training starts only after data contracts are trusted. Then experiments compare architectures, parameters, and sampling choices. Distributed training helps when models or datasets become large. Experiment tracking helps when humans forget run details. Both are infrastructure, not luxury add-ons. See. Reproducibility begins before the first GPU spin-up.

raw events → clean tables → feature views
      │             │             │
      └─────────────┼─────────────┘
                    v
               training set
                    │
                    v

Treat labels as products, not leftovers.
Keep feature definitions versioned and reviewable.
Track code, data snapshot, and hyperparameters together.
Prefer repeatable pipelines over notebook heroics.
Training quality depends on upstream discipline.
Simple, no? Garbage joins beat fancy models every time.

Evaluation decides whether deployment is justified¶

Offline evaluation is the first gate, not the final truth. Use holdout sets, slices, and calibration checks. Slice metrics by region, cohort, device, or language. See. Average accuracy can hide broken minority segments. Compare the candidate against the current production baseline. Do not compare only against a paper result. Then ask whether the metric change matters economically. A tiny lift may not repay extra serving cost. A modest lift may still matter in huge traffic systems. So what to do? Bring finance thinking into evaluation. Also review robustness under missing fields and delayed events. The model should fail softly, not theatrically. Now watch. Promotion decisions need human and automated checks.

candidate model
      │
      ├── offline metrics
      ├── slice checks
      ├── cost estimate
      └── readiness review
            ↓

Evaluate quality, cost, fairness, and robustness together.
Keep one baseline model always available for comparison.
Review bad examples, not only aggregate charts.
Promote only when gains survive practical scrutiny.
Good evaluation protects production trust.
See. Better no launch than a blind launch.

Deployment and monitoring complete the loop¶

Deployment means packaging the model with its runtime contract. The contract includes inputs, outputs, latency budget, and owner. Batch deployment and online deployment need different runbooks. Online serving needs rollback plans before traffic starts. Monitoring begins on day zero, not after incident one. Track request volume, latency, cost, input drift, and business outcome. See. Monitoring is not only dashboards; it is decision support. Alerts should trigger clear actions, not panic. Some alerts page humans. Some create retraining jobs. So what to do? Tie every alert to an owner and playbook. Close the loop by learning from production feedback. That feedback updates data, features, thresholds, or model choice. Now the lifecycle restarts with better evidence.

deploy → serve → observe → decide
   ↑                        │
   └──── retrain / retune ──┘
            │
            v
        next release
            │

Deployment without monitoring is unfinished work.
Monitoring without action is expensive decoration.
Retraining is a business decision, not a ritual.
Keep rollback fast and boring.
Mature teams treat the loop as one platform.
Simple, no? Production is where the syllabus becomes real.

Where this lives in the wild¶

A marketplace team scores listings, watches drift, and retrains ranking models weekly.
A bank frames fraud prediction with strict false-positive budgets and rollback rules.
A logistics platform updates ETA models as road patterns shift across cities.
A healthcare workflow team separates offline validation from clinical rollout approvals.
A consumer app team ties recommender metrics directly to retention and cost dashboards.

Pause and recall¶

Why must problem framing happen before model selection?
Why can offline metric gains still fail in production?
What makes point-in-time correctness so important for training data?
Why is monitoring part of the lifecycle instead of a postscript?

Interview Q&A¶

Q: Walk me through the ML lifecycle in production. A: Start with problem framing, move through data and training, then evaluation, deployment, and monitoring. Explain how production feedback restarts the loop. Common wrong answer to avoid: It is just data collection, model training, and deployment. Monitoring is optional later.

Q: Why is offline evaluation not enough? A: Because offline data may miss production latency, traffic mix, user feedback loops, and real business costs. It is necessary but incomplete. Common wrong answer to avoid: If test accuracy is high, production success is already proven.

Q: What is the biggest lifecycle mistake teams make? A: They optimize model quality while leaving ownership, data contracts, and rollback paths vague. Platform gaps, not math gaps, cause the pain. Common wrong answer to avoid: They choose the wrong neural network library.

Q: When should retraining happen? A: Retrain when monitored evidence shows drift, degraded outcomes, or materially better data. Use a trigger policy, not superstition. Common wrong answer to avoid: Retrain every day because more training is always safer.

Apply now (5 min)¶

Pick one prediction use case from your company or portfolio. Write the user decision, the primary metric, and one guardrail metric. Then draw the six-stage lifecycle on paper. Mark which stage currently has the weakest owner. Mark where rollback would fail today. Finally, name one metric your monitoring must page on. This takes five minutes and exposes platform gaps fast.

Bridge. Lifecycle mapped. The kitchen is the heart — let's build training infra. → 02