03. Week 17 — Study Material¶

Theme¶

Lifecycle management, safe promotion, serving infrastructure, monitoring, and rollback. Use this file as the lookup-sheet companion to 02_explainer.md.

How to use this file¶

Read 02_explainer.md first for the story.
Use this file for tool comparisons, checklists, and interview phrasing.
Revisit 04_daily_recall.md after each section.

1. Lifecycle management essentials¶

Experiment tracking checklist¶

Every serious run should log: - commit SHA, - dataset snapshot, - feature version, - hyperparameters, - seeds, - hardware / container image, - metrics, - artifact links, - owner and approval status.

Registry stages¶

Stage	Meaning	Action
Draft	Experiment exists, not trusted	Keep iterating
Staging	Candidate passed initial checks	Validate further
Production	Approved champion	Serve it
Archived	Retained for audit or rollback	Do not route new traffic

Good model-card fields¶

intended use,
dataset summary,
eval metrics and slices,
failure modes,
latency profile,
cost profile,
safety notes.

Cross-ref: see 02_explainer.md §2.2-§2.10.

2. Tool comparison — lifecycle stack¶

Problem	Common tools	Notes
Run tracking	MLflow, W&B, Neptune	Habit matters more than vendor
Registry	MLflow Registry, SageMaker Model Registry, Vertex Model Registry	Promotion evidence is key
Artifact storage	S3, GCS, Azure Blob	Use immutable paths
Data versioning	DVC, LakeFS, Delta/Iceberg patterns	Git alone is not enough
Feature management	Feast, Tecton, managed cloud feature stores	Strongest for tabular ML

Minimum viable setup for a small team¶

GitHub + MLflow + S3 + DVC + Grafana.
Add a feature store only when train-serve skew or feature reuse hurts.
Prefer boring reliability over theoretical perfection.

Cross-ref: see 02_explainer.md §2.4-§2.12.

3. CI/CD for ML¶

Pipeline stages¶

code/data change
  ↓
validate data
  ↓
build features
  ↓
train
  ↓
evaluate
  ↓
promote or reject
  ↓
deploy safely

What belongs in the quality gate¶

headline metric threshold,
slice metrics,
regression against champion,
latency ceiling,
cost guardrail,
safety / policy checks,
schema sanity.

Automation guidance¶

Situation	Automate?	Why
Stable weekly retraining with clean labels	Yes, mostly	Predictable cadence
Noisy labels, high-risk domain	Partially	Human promotion still needed
Unlabeled drift only	No direct promotion	Detection is not proof of improvement

Pipeline tool map¶

Tool	When to use it
GitHub Actions	Small team, light orchestration
Airflow	Batch scheduling across many steps
Kubeflow Pipelines	K8s-heavy ML platform
SageMaker / Vertex Pipelines	Managed cloud-first org

Cross-ref: see 02_explainer.md §3.1-§3.12.

4. Serving infrastructure quick-reference¶

Serving stack comparison¶

Stack	Strongest point	Watch-out
vLLM	Throughput for open-weight LLMs	LLM-focused
TGI	Hugging Face ecosystem fit	Usually a bit slower than vLLM
Triton	Multi-framework flexibility	More ops work
Managed endpoints	Less infrastructure burden	Higher cost, less control

Performance levers¶

dynamic or continuous batching,
prompt / prefix caching,
queue-aware autoscaling,
route simple tasks to cheaper models,
cap output length,
choose GPU by utilization profile.

Deployment strategies¶

Strategy	What it gives
Shadow	Safe realism without user impact
Canary	Small live exposure
Blue-green	Instant cutover and rollback
Percentage rollout	Controlled confidence ramp

Cross-ref: see 02_explainer.md §4.1-§4.12.

5. Monitoring, drift, and maintenance¶

Monitor four layers together¶

System.
Data.
Model.
Business.

Drift cheat sheet¶

Drift	Signal	First question
Data drift	PSI, schema shifts, null spikes	Did inputs change?
Model drift	Quality decline	Did predictions worsen?
Concept drift	Labels or business logic changed	Is the world different now?
Vendor drift	Same prompt, new answer	Did the provider change under us?

LLM production metrics¶

TTFT and total latency,
tokens in/out,
cache hit rate,
cost per request,
refusal and fallback rate,
judged quality or user feedback.

Incident-response skeleton¶

Detect.
Scope.
Mitigate.
Roll back if needed.
Verify recovery.
Write postmortem.

Cross-ref: see 02_explainer.md §5.1-§5.13.

6. Interview frame and production vocabulary¶

Useful answer starters¶

“I would start by restoring lineage before touching retraining.”
“I treat the registry as a promotion control plane, not just storage.”
“For serving, I would separate latency, throughput, and utilization decisions.”
“My monitoring design includes system, model, and business signals together.”
“Rollback has to cover weights, prompts, indexes, and routing rules.”

Specific tool-and-cost vocabulary to sound credible¶

“MLflow + S3 is a practical small-team baseline.”
“vLLM usually wins when throughput matters for open weights.”
“Managed endpoints buy speed, but cost and control trade away.”
“GPU economics depend more on utilization than sticker price.”

7. Health check¶

[ ] I can explain the warehouse, the quality gate, and the production monitor.
[ ] I can list the minimum run metadata from memory.
[ ] I can describe when automated retraining is unsafe.
[ ] I can compare shadow, canary, and blue-green clearly.
[ ] I can move into 05_hands_on_lab.md without confusion.