Skip to content

03. Week 17 — Study Material

Theme

Lifecycle management, safe promotion, serving infrastructure, monitoring, and rollback. Use this file as the lookup-sheet companion to 02_explainer.md.

How to use this file

1. Lifecycle management essentials

Experiment tracking checklist

Every serious run should log: - commit SHA, - dataset snapshot, - feature version, - hyperparameters, - seeds, - hardware / container image, - metrics, - artifact links, - owner and approval status.

Registry stages

Stage Meaning Action
Draft Experiment exists, not trusted Keep iterating
Staging Candidate passed initial checks Validate further
Production Approved champion Serve it
Archived Retained for audit or rollback Do not route new traffic

Good model-card fields

  • intended use,
  • dataset summary,
  • eval metrics and slices,
  • failure modes,
  • latency profile,
  • cost profile,
  • safety notes.

Cross-ref: see 02_explainer.md §2.2-§2.10.

2. Tool comparison — lifecycle stack

Problem Common tools Notes
Run tracking MLflow, W&B, Neptune Habit matters more than vendor
Registry MLflow Registry, SageMaker Model Registry, Vertex Model Registry Promotion evidence is key
Artifact storage S3, GCS, Azure Blob Use immutable paths
Data versioning DVC, LakeFS, Delta/Iceberg patterns Git alone is not enough
Feature management Feast, Tecton, managed cloud feature stores Strongest for tabular ML

Minimum viable setup for a small team

  • GitHub + MLflow + S3 + DVC + Grafana.
  • Add a feature store only when train-serve skew or feature reuse hurts.
  • Prefer boring reliability over theoretical perfection.

Cross-ref: see 02_explainer.md §2.4-§2.12.

3. CI/CD for ML

Pipeline stages

code/data change
validate data
build features
train
evaluate
promote or reject
deploy safely

What belongs in the quality gate

  • headline metric threshold,
  • slice metrics,
  • regression against champion,
  • latency ceiling,
  • cost guardrail,
  • safety / policy checks,
  • schema sanity.

Automation guidance

Situation Automate? Why
Stable weekly retraining with clean labels Yes, mostly Predictable cadence
Noisy labels, high-risk domain Partially Human promotion still needed
Unlabeled drift only No direct promotion Detection is not proof of improvement

Pipeline tool map

Tool When to use it
GitHub Actions Small team, light orchestration
Airflow Batch scheduling across many steps
Kubeflow Pipelines K8s-heavy ML platform
SageMaker / Vertex Pipelines Managed cloud-first org

Cross-ref: see 02_explainer.md §3.1-§3.12.

4. Serving infrastructure quick-reference

Serving stack comparison

Stack Strongest point Watch-out
vLLM Throughput for open-weight LLMs LLM-focused
TGI Hugging Face ecosystem fit Usually a bit slower than vLLM
Triton Multi-framework flexibility More ops work
Managed endpoints Less infrastructure burden Higher cost, less control

Performance levers

  • dynamic or continuous batching,
  • prompt / prefix caching,
  • queue-aware autoscaling,
  • route simple tasks to cheaper models,
  • cap output length,
  • choose GPU by utilization profile.

Deployment strategies

Strategy What it gives
Shadow Safe realism without user impact
Canary Small live exposure
Blue-green Instant cutover and rollback
Percentage rollout Controlled confidence ramp

Cross-ref: see 02_explainer.md §4.1-§4.12.

5. Monitoring, drift, and maintenance

Monitor four layers together

  1. System.
  2. Data.
  3. Model.
  4. Business.

Drift cheat sheet

Drift Signal First question
Data drift PSI, schema shifts, null spikes Did inputs change?
Model drift Quality decline Did predictions worsen?
Concept drift Labels or business logic changed Is the world different now?
Vendor drift Same prompt, new answer Did the provider change under us?

LLM production metrics

  • TTFT and total latency,
  • tokens in/out,
  • cache hit rate,
  • cost per request,
  • refusal and fallback rate,
  • judged quality or user feedback.

Incident-response skeleton

  1. Detect.
  2. Scope.
  3. Mitigate.
  4. Roll back if needed.
  5. Verify recovery.
  6. Write postmortem.

Cross-ref: see 02_explainer.md §5.1-§5.13.

6. Interview frame and production vocabulary

Useful answer starters

  • “I would start by restoring lineage before touching retraining.”
  • “I treat the registry as a promotion control plane, not just storage.”
  • “For serving, I would separate latency, throughput, and utilization decisions.”
  • “My monitoring design includes system, model, and business signals together.”
  • “Rollback has to cover weights, prompts, indexes, and routing rules.”

Specific tool-and-cost vocabulary to sound credible

  • “MLflow + S3 is a practical small-team baseline.”
  • “vLLM usually wins when throughput matters for open weights.”
  • “Managed endpoints buy speed, but cost and control trade away.”
  • “GPU economics depend more on utilization than sticker price.”

7. Health check

  • [ ] I can explain the warehouse, the quality gate, and the production monitor.
  • [ ] I can list the minimum run metadata from memory.
  • [ ] I can describe when automated retraining is unsafe.
  • [ ] I can compare shadow, canary, and blue-green clearly.
  • [ ] I can move into 05_hands_on_lab.md without confusion.