03. Week 17 — Study Material¶
Theme¶
Lifecycle management, safe promotion, serving infrastructure, monitoring, and rollback. Use this file as the lookup-sheet companion to 02_explainer.md.
How to use this file¶
- Read
02_explainer.mdfirst for the story. - Use this file for tool comparisons, checklists, and interview phrasing.
- Revisit
04_daily_recall.mdafter each section.
1. Lifecycle management essentials¶
Experiment tracking checklist¶
Every serious run should log: - commit SHA, - dataset snapshot, - feature version, - hyperparameters, - seeds, - hardware / container image, - metrics, - artifact links, - owner and approval status.
Registry stages¶
| Stage | Meaning | Action |
|---|---|---|
| Draft | Experiment exists, not trusted | Keep iterating |
| Staging | Candidate passed initial checks | Validate further |
| Production | Approved champion | Serve it |
| Archived | Retained for audit or rollback | Do not route new traffic |
Good model-card fields¶
- intended use,
- dataset summary,
- eval metrics and slices,
- failure modes,
- latency profile,
- cost profile,
- safety notes.
Cross-ref: see 02_explainer.md §2.2-§2.10.
2. Tool comparison — lifecycle stack¶
| Problem | Common tools | Notes |
|---|---|---|
| Run tracking | MLflow, W&B, Neptune | Habit matters more than vendor |
| Registry | MLflow Registry, SageMaker Model Registry, Vertex Model Registry | Promotion evidence is key |
| Artifact storage | S3, GCS, Azure Blob | Use immutable paths |
| Data versioning | DVC, LakeFS, Delta/Iceberg patterns | Git alone is not enough |
| Feature management | Feast, Tecton, managed cloud feature stores | Strongest for tabular ML |
Minimum viable setup for a small team¶
- GitHub + MLflow + S3 + DVC + Grafana.
- Add a feature store only when train-serve skew or feature reuse hurts.
- Prefer boring reliability over theoretical perfection.
Cross-ref: see 02_explainer.md §2.4-§2.12.
3. CI/CD for ML¶
Pipeline stages¶
code/data change
↓
validate data
↓
build features
↓
train
↓
evaluate
↓
promote or reject
↓
deploy safely
What belongs in the quality gate¶
- headline metric threshold,
- slice metrics,
- regression against champion,
- latency ceiling,
- cost guardrail,
- safety / policy checks,
- schema sanity.
Automation guidance¶
| Situation | Automate? | Why |
|---|---|---|
| Stable weekly retraining with clean labels | Yes, mostly | Predictable cadence |
| Noisy labels, high-risk domain | Partially | Human promotion still needed |
| Unlabeled drift only | No direct promotion | Detection is not proof of improvement |
Pipeline tool map¶
| Tool | When to use it |
|---|---|
| GitHub Actions | Small team, light orchestration |
| Airflow | Batch scheduling across many steps |
| Kubeflow Pipelines | K8s-heavy ML platform |
| SageMaker / Vertex Pipelines | Managed cloud-first org |
Cross-ref: see 02_explainer.md §3.1-§3.12.
4. Serving infrastructure quick-reference¶
Serving stack comparison¶
| Stack | Strongest point | Watch-out |
|---|---|---|
| vLLM | Throughput for open-weight LLMs | LLM-focused |
| TGI | Hugging Face ecosystem fit | Usually a bit slower than vLLM |
| Triton | Multi-framework flexibility | More ops work |
| Managed endpoints | Less infrastructure burden | Higher cost, less control |
Performance levers¶
- dynamic or continuous batching,
- prompt / prefix caching,
- queue-aware autoscaling,
- route simple tasks to cheaper models,
- cap output length,
- choose GPU by utilization profile.
Deployment strategies¶
| Strategy | What it gives |
|---|---|
| Shadow | Safe realism without user impact |
| Canary | Small live exposure |
| Blue-green | Instant cutover and rollback |
| Percentage rollout | Controlled confidence ramp |
Cross-ref: see 02_explainer.md §4.1-§4.12.
5. Monitoring, drift, and maintenance¶
Monitor four layers together¶
- System.
- Data.
- Model.
- Business.
Drift cheat sheet¶
| Drift | Signal | First question |
|---|---|---|
| Data drift | PSI, schema shifts, null spikes | Did inputs change? |
| Model drift | Quality decline | Did predictions worsen? |
| Concept drift | Labels or business logic changed | Is the world different now? |
| Vendor drift | Same prompt, new answer | Did the provider change under us? |
LLM production metrics¶
- TTFT and total latency,
- tokens in/out,
- cache hit rate,
- cost per request,
- refusal and fallback rate,
- judged quality or user feedback.
Incident-response skeleton¶
- Detect.
- Scope.
- Mitigate.
- Roll back if needed.
- Verify recovery.
- Write postmortem.
Cross-ref: see 02_explainer.md §5.1-§5.13.
6. Interview frame and production vocabulary¶
Useful answer starters¶
- “I would start by restoring lineage before touching retraining.”
- “I treat the registry as a promotion control plane, not just storage.”
- “For serving, I would separate latency, throughput, and utilization decisions.”
- “My monitoring design includes system, model, and business signals together.”
- “Rollback has to cover weights, prompts, indexes, and routing rules.”
Specific tool-and-cost vocabulary to sound credible¶
- “MLflow + S3 is a practical small-team baseline.”
- “vLLM usually wins when throughput matters for open weights.”
- “Managed endpoints buy speed, but cost and control trade away.”
- “GPU economics depend more on utilization than sticker price.”
7. Health check¶
- [ ] I can explain the warehouse, the quality gate, and the production monitor.
- [ ] I can list the minimum run metadata from memory.
- [ ] I can describe when automated retraining is unsafe.
- [ ] I can compare shadow, canary, and blue-green clearly.
- [ ] I can move into
05_hands_on_lab.mdwithout confusion.