Skip to content

03. Week 16 — Engineering Principles

Companion to 02_explainer.md. Read this after the narrative if you want the compressed version.

1. The Lead AI Engineer lens

A senior engineer solves difficult local problems. A Lead AI Engineer improves the decision system around those problems.

Use this captain vocabulary throughout the week:

Term Meaning
the course Technical direction
the compass Decision framework
the crew Team and stakeholders
the weather check Risk assessment
the ship's log Documentation, ADRs, runbooks

2. Decision framework basics

Start with the decision question, not the tool. Use these filters in order:

  1. What job must the system do?
  2. What constraints are non-negotiable?
  3. What is the cheapest reversible path?
  4. What evidence would make us change later?
  5. What new operational burden comes with this choice?

Build vs buy vs fine-tune

Option Best when Main downside Default posture
Buy API Uncertain use case, need speed Less control, vendor dependence Start here
Prompt + workflow Model is capable, task framing is weak Can grow messy without discipline Usually next
RAG Domain knowledge is missing Retrieval quality becomes critical Before fine-tune
Fine-tune Stable pattern, strong labeled data, repeated volume Slow, expensive, extra ops Later, with evidence
Self-host Privacy, cost curve, or platform leverage matter Highest ops burden Only when justified

Reversibility

Decision type Example Process level
Two-way door Prompt wording, small routing tweak Decide fast, monitor
One-way-ish door Vendor contract, retention policy ADR + review
One-way door Self-hosting stack, org-wide platform commitment RFC + ADR + staged rollout

3. Decision records and documentation habits

Use ADRs for high-coupling choices. Keep them short and useful.

Minimum ADR template - Title - Status - Context - Options considered - Decision - Consequences - Trigger to revisit

Minimum system docs

Document Why it exists
README Understand the system fast
ADR Preserve technical reasoning
Eval spec Show how quality is measured
Runbook Make incidents survivable
Prompt/model card Track versions, limits, risks

4. Code and system quality for AI

AI quality has three layers.

Layer What it answers Examples
Unit tests Does deterministic glue behave correctly? Prompt builder, parser, tool router, fallback logic
Evals Is model behavior good enough for the task? Accuracy, faithfulness, safety, usefulness
Monitoring Is the live system healthy right now? Latency, cost, errors, drift, user feedback

What to unit test

  • Prompt templates and required variables.
  • Retrieval filters and tenant isolation.
  • Structured output parsing and schema validation.
  • Tool permission checks.
  • Timeout and fallback logic.
  • Cost accounting functions.

What to evaluate

  • Task accuracy.
  • Faithfulness to retrieved context.
  • Safety and policy compliance.
  • Refusal correctness.
  • Latency-quality tradeoff.
  • Cost-quality tradeoff.

Observability baseline

Log, at minimum: - Request and trace IDs. - Model version. - Prompt or workflow version. - Retrieval sources. - Latency buckets. - Token usage and cost. - Error class. - Online quality signal if available.

Error budgets

Error budgets turn “we should improve reliability” into a forcing function. If the budget burns too quickly, risky launches pause. That principle matters even before full MLOps.

5. Technical debt in AI systems

Debt type Example Cost later
Prompt debt Giant patched prompt Hidden regressions
Eval debt Tiny stale benchmark False confidence
Data debt Weak labels, unclear provenance Misleading fine-tuning
Workflow debt Agent loop without stop rules Cost spikes and unsafe behavior
Ops debt No rollback or emergency disable Long incidents
Knowledge debt Context trapped in one engineer Slow onboarding

6. Team and process

Code review for AI code

Review behavior, not only syntax. Ask: - What changed in prompts, retrieval, tools, or models? - Which evals cover this change? - What is the fallback path? - What new risks appear? - What is the latency or cost impact?

Sprint planning for research-heavy work

Plan around questions and exit criteria. Examples: - “Does retrieval improve answer quality by 10 points?” - “Can we keep p95 latency under 3 seconds with citations enabled?”

Do not plan research like deterministic CRUD work. Also do not hide behind ambiguity forever.

When to automate vs manual

Signal Manual-first Automate sooner
Volume Low High
Risk High, unclear edge cases Lower, well-understood
Observability Weak Strong
Response-time pressure Low High
Task stability Unstable Stable and repeated

7. Communication and influence

Translate the same technical decision differently for each audience.

Audience Emphasis
Product User value, scope, confidence band
Legal/compliance Data handling, controls, auditability
Finance Cost curve, break-even, budget risk
Executives Strategic leverage and downside
Ops/support Failure handling and ownership

RFC skeleton

  1. Problem.
  2. Constraints.
  3. Options.
  4. Recommendation.
  5. Tradeoffs.
  6. Risks.
  7. Rollout and revisit plan.

8. Foundation-gap audit for Module 17

Module 17 assumes you already know:

  1. Decision framework basics — you can defend a technical choice with criteria.
  2. When to automate vs manual — you can stage automation instead of forcing it.
  3. Risk assessment — you think in blast radius, rollback, and user harm.
  4. Documentation habits — you keep ADRs, eval notes, and runbooks current enough.

If these are missing, MLOps feels like tooling trivia. If these are solid, MLOps feels like principled infrastructure.

9. Bridge forward

Next module — 04_ml_platform_operations — operationalizes these principles into concrete infrastructure: CI/CD for ML, model registries, monitoring, and the platform that makes good engineering automatic.

10. Study order

  1. Read 02_explainer.md for intuition.
  2. Revisit this file for compression.
  3. Use 04_daily_recall.md daily.
  4. Finish 05_hands_on_lab.md.
  5. Close with 06_revision.md.

References