03. Week 16 — Engineering Principles¶

Companion to 02_explainer.md. Read this after the narrative if you want the compressed version.

1. The Lead AI Engineer lens¶

A senior engineer solves difficult local problems. A Lead AI Engineer improves the decision system around those problems.

Use this captain vocabulary throughout the week:

Term	Meaning
the course	Technical direction
the compass	Decision framework
the crew	Team and stakeholders
the weather check	Risk assessment
the ship's log	Documentation, ADRs, runbooks

2. Decision framework basics¶

Start with the decision question, not the tool. Use these filters in order:

What job must the system do?
What constraints are non-negotiable?
What is the cheapest reversible path?
What evidence would make us change later?
What new operational burden comes with this choice?

Build vs buy vs fine-tune¶

Option	Best when	Main downside	Default posture
Buy API	Uncertain use case, need speed	Less control, vendor dependence	Start here
Prompt + workflow	Model is capable, task framing is weak	Can grow messy without discipline	Usually next
RAG	Domain knowledge is missing	Retrieval quality becomes critical	Before fine-tune
Fine-tune	Stable pattern, strong labeled data, repeated volume	Slow, expensive, extra ops	Later, with evidence
Self-host	Privacy, cost curve, or platform leverage matter	Highest ops burden	Only when justified

Reversibility¶

Decision type	Example	Process level
Two-way door	Prompt wording, small routing tweak	Decide fast, monitor
One-way-ish door	Vendor contract, retention policy	ADR + review
One-way door	Self-hosting stack, org-wide platform commitment	RFC + ADR + staged rollout

3. Decision records and documentation habits¶

Use ADRs for high-coupling choices. Keep them short and useful.

Minimum ADR template - Title - Status - Context - Options considered - Decision - Consequences - Trigger to revisit

Minimum system docs

Document	Why it exists
README	Understand the system fast
ADR	Preserve technical reasoning
Eval spec	Show how quality is measured
Runbook	Make incidents survivable
Prompt/model card	Track versions, limits, risks

4. Code and system quality for AI¶

AI quality has three layers.

Layer	What it answers	Examples
Unit tests	Does deterministic glue behave correctly?	Prompt builder, parser, tool router, fallback logic
Evals	Is model behavior good enough for the task?	Accuracy, faithfulness, safety, usefulness
Monitoring	Is the live system healthy right now?	Latency, cost, errors, drift, user feedback

What to unit test¶

Prompt templates and required variables.
Retrieval filters and tenant isolation.
Structured output parsing and schema validation.
Tool permission checks.
Timeout and fallback logic.
Cost accounting functions.

What to evaluate¶

Task accuracy.
Faithfulness to retrieved context.
Safety and policy compliance.
Refusal correctness.
Latency-quality tradeoff.
Cost-quality tradeoff.

Observability baseline¶

Log, at minimum: - Request and trace IDs. - Model version. - Prompt or workflow version. - Retrieval sources. - Latency buckets. - Token usage and cost. - Error class. - Online quality signal if available.

Error budgets¶

Error budgets turn “we should improve reliability” into a forcing function. If the budget burns too quickly, risky launches pause. That principle matters even before full MLOps.

5. Technical debt in AI systems¶

Debt type	Example	Cost later
Prompt debt	Giant patched prompt	Hidden regressions
Eval debt	Tiny stale benchmark	False confidence
Data debt	Weak labels, unclear provenance	Misleading fine-tuning
Workflow debt	Agent loop without stop rules	Cost spikes and unsafe behavior
Ops debt	No rollback or emergency disable	Long incidents
Knowledge debt	Context trapped in one engineer	Slow onboarding

6. Team and process¶

Code review for AI code¶

Review behavior, not only syntax. Ask: - What changed in prompts, retrieval, tools, or models? - Which evals cover this change? - What is the fallback path? - What new risks appear? - What is the latency or cost impact?

Sprint planning for research-heavy work¶

Plan around questions and exit criteria. Examples: - “Does retrieval improve answer quality by 10 points?” - “Can we keep p95 latency under 3 seconds with citations enabled?”

Do not plan research like deterministic CRUD work. Also do not hide behind ambiguity forever.

When to automate vs manual¶

Signal	Manual-first	Automate sooner
Volume	Low	High
Risk	High, unclear edge cases	Lower, well-understood
Observability	Weak	Strong
Response-time pressure	Low	High
Task stability	Unstable	Stable and repeated

7. Communication and influence¶

Translate the same technical decision differently for each audience.

Audience	Emphasis
Product	User value, scope, confidence band
Legal/compliance	Data handling, controls, auditability
Finance	Cost curve, break-even, budget risk
Executives	Strategic leverage and downside
Ops/support	Failure handling and ownership

RFC skeleton¶

Problem.
Constraints.
Options.
Recommendation.
Tradeoffs.
Risks.
Rollout and revisit plan.

8. Foundation-gap audit for Module 17¶

Module 17 assumes you already know:

Decision framework basics — you can defend a technical choice with criteria.
When to automate vs manual — you can stage automation instead of forcing it.
Risk assessment — you think in blast radius, rollback, and user harm.
Documentation habits — you keep ADRs, eval notes, and runbooks current enough.

If these are missing, MLOps feels like tooling trivia. If these are solid, MLOps feels like principled infrastructure.

9. Bridge forward¶

Next module — 04_ml_platform_operations — operationalizes these principles into concrete infrastructure: CI/CD for ML, model registries, monitoring, and the platform that makes good engineering automatic.

10. Study order¶

Read 02_explainer.md for intuition.
Revisit this file for compression.
Use 04_daily_recall.md daily.
Finish 05_hands_on_lab.md.
Close with 06_revision.md.