05. CI/CD for ML — make retraining boring on purpose¶
~14 min read. Good ML teams ship models like a calm factory shift.
Built on the ELI5 in 00-eli5.md. The assembly line — the repeatable path from raw parts to finished goods — is how ML changes move safely into production.
Why software CI/CD alone misses the ML problem¶
See.
Normal software CI/CD assumes code is the main moving part. That assumption breaks for ML systems.
Model behavior also changes when data changes. Feature logic and labels change behavior too.
Even without new code, a new data slice can move precision, recall, and calibration.
So one clean pull request is not enough evidence for safe promotion.
The assembly line for ML must watch more than source code.
It must watch data contracts, feature generation, training settings, and evaluation evidence.
It must also watch whether the candidate is eligible for production traffic.
Look at the picture first.
source change / data change / feature change
│
▼
┌──────────┐ ┌──────────┐ ┌────────────┐
│ validate │─→│ train │─→│ evaluate │
└────┬─────┘ └────┬─────┘ └─────┬──────┘
│ │ │
│ ▼ ▼
│ model artifact quality report
│ │
│ ▼
│ ┌────────────┐
└────────────────────→│quality gate│
└─────┬──────┘
│
approve │ hold / reject
▼
┌──────────┐
│ registry │
└────┬─────┘
▼
deploy
Simple, no?
If your CI says, "tests passed," that still proves very little about model behavior.
A passing unit test does not prove recall stayed healthy on critical slices.
A green Docker build does not prove latency or cost stayed controlled.
So what to do?
Make the pipeline responsible for model evidence, not only packaging and release automation.
That is why ML CI/CD looks heavier. It is operational honesty, not bureaucracy.
What the assembly line actually handles end to end¶
The ML pipeline usually starts with a trigger. That trigger may be code, fresh data, a schedule, or a manual promotion request.
Before expensive work starts, the pipeline should validate inputs and assumptions.
Are schemas still compatible? Are feature definitions versioned correctly and available everywhere?
Did training data land fully? Are labels fresh enough for this run?
Only after that should training begin. Otherwise you burn compute on broken inputs.
The pipeline then trains a candidate model and saves metadata beside the artifact.
Which commit produced it? Which feature set fed it? Which data window trained it?
Which hyperparameters were used? Without this metadata, debugging becomes guesswork.
After training comes evaluation. Now the assembly line stops being a build system and becomes a decision system.
The candidate may be compared against fixed thresholds, the current champion, or both.
Then comes the quality gate. If evidence is strong, the artifact enters the registry.
If not, the run is held or rejected. Only approved artifacts should be deployable.
Here is the lane in one compact view.
commit or schedule
│
▼
validate inputs
│
▼
train candidate
│
▼
evaluate metrics
│
▼
quality gate decision
├── reject
├── hold for review
└── approve
│
▼
push to registry
│
▼
deploy to serving stack
Yes?
Notice one important thing. Training and deployment are different jobs.
Many teams combine them too early. That creates fear and hidden coupling.
A bad training run should never become production traffic by accident.
The assembly line must make those steps explicit so rollback stays simple.
A tiny numerical example of why ML CI/CD needs evaluation¶
Suppose yesterday's fraud model had these results. Overall recall was 0.91, high-value transaction recall was 0.88, and latency p95 was 48 ms.
Now a new training run finishes. Overall recall becomes 0.92, which looks good at first glance.
But high-value transaction recall falls to 0.74. Latency p95 rises to 71 ms.
If software CI only checks "training job succeeded," it ships danger with confidence.
If the ML pipeline checks real evidence, it blocks danger before customers see it.
Look at the compact table.
metric champion candidate decision
overall recall 0.91 0.92 okay
high-value txn recall 0.88 0.74 fail
latency p95 48 ms 71 ms fail
So what to do?
Teach the pipeline which failures are unacceptable, and encode that as policy.
The model build is not the finish line. The decision after evaluation is the finish line.
This is why software CI/CD alone is not enough for ML delivery.
You are shipping behavior, not only binaries.
Make retraining boring, then choose orchestration carefully¶
Training pipelines are meant to feel boring. That is praise.
When retraining feels dramatic, your process is weak and your team becomes superstitious.
A boring run means inputs were versioned, checks ran automatically, and failures were obvious.
It means artifacts were captured cleanly and roll-forward or rollback choices stayed simple.
See how healthy that sounds.
You do not want heroics every Tuesday. You want a calm assembly line that repeats safely.
Now which tools help? GitHub Actions works well for repo-native automation and smaller platform needs.
Airflow helps when DAG scheduling, backfills, and data dependency timing dominate the workflow.
Kubeflow fits teams running Kubernetes-heavy ML platforms with containerized steps and platform engineering support.
SageMaker Pipelines fits AWS-centric teams wanting managed training, evaluation, packaging, and promotion flow.
None of these tools magically gives good judgment. They give orchestration structure.
Your checks still matter more than the brand name. Simple, no?
The hardest question is automated retraining. Sometimes it is safe, and sometimes it is reckless.
Safe means the task is stable, labels arrive reliably, evaluation is trusted, and rollback is easy.
Reckless means data is noisy, ground truth arrives late, or a silent failure could harm users.
For example, daily retraining for ad ranking may be normal. Daily retraining for medical risk scoring may be careless.
Even in safe cases, promotion should not be blind. Let retraining be automatic, and let promotion stay conditional.
That is mature practice. Yes?
Where this lives in the wild¶
- Netflix recommendations — ML platform engineer: runs scheduled training and evaluation pipelines before candidate recommenders reach member traffic.
- Uber Eats ranking — machine learning engineer: combines feature updates, retraining jobs, and guarded rollout steps for marketplace models.
- Stripe Radar — risk platform engineer: treats data drift and threshold checks as part of promotion, not only code tests.
- Duolingo personalization — MLOps engineer: uses orchestrated training jobs and approval logic before new learner models go live.
- Amazon demand forecasting — applied scientist: relies on repeatable retraining workflows so forecast refreshes do not become manual fire drills.
Pause and recall¶
- Why is normal software CI/CD insufficient for ML systems?
- What stages sit between training a model and safely deploying it?
- Why is "trained" different from "trusted"?
- When is automated retraining useful, and when is it reckless?
Interview Q&A¶
Q: Why can a passing software CI pipeline still hide an ML regression? A: Software CI mainly proves the code builds and tests pass. ML behavior also depends on data, features, and evaluation results, so the model can regress even when the application package is healthy. Common wrong answer to avoid: "Because CI tools cannot run Python." The issue is missing model evidence, not language support.
Q: Why should training and deployment be separate stages in ML delivery? A: Training creates a candidate artifact, while deployment should happen only after evaluation and approval. Separating them prevents every successful training run from becoming production traffic automatically. Common wrong answer to avoid: "Because deployment always needs a human." Humans are optional; explicit evidence gates are not.
Q: When is automated retraining a good idea? A: It works when labels are reliable, evaluation is trusted, failure is reversible, and business harm from silent drift is bounded. In that setting, automation reduces toil without hiding too much risk. Common wrong answer to avoid: "Always schedule retraining daily." Frequency alone says nothing about safety.
Q: How do orchestration tools like Airflow or GitHub Actions help ML CI/CD? A: They coordinate repeatable steps, dependencies, triggers, and artifact flow across training and evaluation. They provide structure, but your quality criteria still decide whether the system is safe. Common wrong answer to avoid: "The tool itself guarantees model quality." Orchestration is plumbing, not judgment.
Apply now (5 min)¶
Exercise. Take one model you know and write a six-step pipeline for it. Include trigger, validation, training, evaluation, gate, and deployment.
Then list three checks that software CI would miss.
Sketch from memory. Draw the assembly line from source change to deploy.
Label where training ends, where the quality gate decides, and where the registry stores approved artifacts.
Bridge. A fast pipeline is useful only when it can stop bad candidates. Next we study the quality gate that keeps the assembly line honest. → 06-quality-gates.md