Practical Take-Home Prompts — Interview Questions¶

The "you have 2-6 hours, build a working slice and write it up" round. Different from ml-coding-rounds.md (from-scratch ML primitives) and classic-algo.md (DSA-flavored). This file is the realistic miniature project interview: small but end-to-end, requires real judgment on architecture, error handling, prompts, evals, and tradeoffs.

The senior tell is not the cleverness of the code. It's a short, honest README — what you built, what you punted, what you'd do next, what the test cases are, what the failure modes are. Treat the writeup as part of the deliverable; many teams weight it equally with the code.

The takehomes here are real archetypes from 2025-2026 AI engineer loops. Each one includes: prompt, scoring criteria, an outline of what a strong submission looks like, common traps, and how to talk about it in the followup interview ("walk me through your README").

Universal advice¶

What gets you a strong signal¶

Working code with a one-command run: pip install -r requirements.txt && python main.py. Or a Dockerfile. Anything more than two steps loses signal.
A short README (~1-2 pages) covering: problem framing, design decisions, what you punted and why, how to test, sample outputs, what you'd do with more time.
Test cases that demonstrate the system handling: golden case, edge case, adversarial case. Even 3-5 tests is enough.
One eval of any kind. Even a 10-example handmade golden set with a simple correctness function beats no eval.
Latency / cost honesty. Add a paragraph: "this costs $X per call, p95 latency is Y, here's where time is spent". Demonstrates production thinking.
Naming model versions explicitly. gpt-4o-2024-08-06, not gpt-4o. Mention why.
Acknowledge what you didn't do: "no auth, no rate limiting, no DB — would add for prod". Better than leaving the interviewer to wonder.

What loses signal¶

800-line main.py doing everything. Even simple modularity (retrieval.py, llm.py, prompts.py, evals.py) signals seniority.
No prompts version-controlled. Prompts inline in code, no comments, no rationale.
No mention of cost, latency, or eval. You shipped without thinking about production.
Mocked the LLM entirely. Even one real call with a real provider beats a stubbed transformer.
Untested error paths. Show what happens on retrieval miss, model timeout, malformed output.
"I would have done X if I had more time" — but X is the heart of the problem. If asked to build a RAG pipeline, don't punt retrieval.
Over-engineering. The 4-hour take-home with 12 files, a custom logging framework, and a config DSL signals poor judgment about scope.

How to talk about it in the follow-up¶

Walk through the README, not the code. The interviewer wants to hear your thinking.
Be specific about decisions: "I picked top-K=5 because at K=10 I saw context-stuffing hurt faithfulness on my evals."
Volunteer the tradeoffs you considered and rejected: "I considered hybrid retrieval; for the corpus size it wasn't worth the complexity."
Say what's broken. Volunteering known issues earns trust.
Quantify when you can. "On 50 test queries, faithfulness was 0.78."

Archetype 1: RAG over a small corpus¶

Q: "Build a Q&A system over the included 50 PDFs. We'll ask questions; you return cited answers."¶

Tags: senior · very-common · coding · source: 2025-2026 AI engineer take-homes; standard RAG archetype

Scoring criteria the loop will use: - Ingestion: how documents become chunks. Do you choose chunk size sensibly? Strip headers/footers? Preserve doc/page metadata? - Retrieval: which embeddings, which DB, what top-K, any reranking? Why? - Generation: prompt design, citation format, refusal behavior when no good context. - Quality: are answers grounded? Cited? Does it refuse on out-of-corpus queries? - Evals: do you have any? Even a small handmade set with faithfulness scoring counts. - Engineering: code structure, README clarity, repeatability.

Outline of a strong submission: - Stack: pgvector or Qdrant (or even FAISS if local-only), text-embedding-3-small, gpt-4o-mini for generation, judge-model gpt-4o for evals. - Ingestion (ingest.py): - PDF → text via pypdf or pdfplumber. - Chunk to ~500-800 tokens with 50-100 token overlap. Why: most papers have ~300-700 token paragraphs; this keeps full paragraphs together while preserving cross-chunk context. - Strip headers/footers via heuristic (lines appearing >X% of pages). - Preserve (doc_id, page_num, chunk_idx) metadata on every chunk. - Embed and upsert. - Retrieval (retrieval.py): - Top-K dense retrieval (K=5-10). - Optional: cross-encoder reranker if quality demands; mention bge-reranker-base or cohere-rerank-3. - Mention you considered BM25/hybrid; decided not worth complexity at this corpus size (or did include — explain why). - Generation (answer.py): - Prompt template includes: system rules ("answer only from context, cite source"), retrieved chunks with [doc_id p.N] markers, user question. - Citation format: [doc.pdf p.3] inline. Parse and verify all citations refer to chunks actually in context. - Refusal: if top-K retrieval scores are all below threshold, return "I don't have enough info in the corpus" with no citations. - Evals (eval.py): - 10-20 hand-crafted Q&A pairs with reference answers. - Metrics: faithfulness (LLM-judge: does the answer follow from the context?), citation precision (does each cited chunk actually support its claim?), recall on the reference (similarity). - Print a table: per-question score + aggregate. - README: - One paragraph each on the four design decisions (chunking, retrieval, prompting, eval). - Sample question + cited answer. - Cost: "$X per query at K=5, embedding cost $Y for full ingest". - Latency: "p50 ~1.2s, p95 ~2.5s". - Known limitations: handles only English text PDFs; tables/figures ignored. - With more time: hybrid retrieval, query expansion, multi-hop, caching.

Common follow-ups in the interview: - "Why these chunk sizes?" - "Walk me through a failing query and how you'd fix it." - "What's your eval missing?" - "Suppose the corpus grows to 10M docs — what changes?"

Traps: - No metadata on chunks → citations are vague or fabricated. - No refusal behavior → model confabulates on out-of-corpus questions. - 5000-token chunks → poor retrieval precision, blown context budget. - Zero evals. Even 10 manual examples is fine; zero is a signal.

Related cross-cutting: Architecture choices, Evaluation & quality Related module: learning/01_ai_engineering/08_rag_system_design/, learning/01_ai_engineering/14_retrieval_ranking/

Archetype 2: LLM + structured output¶

Q: "Build a service that takes free-text customer support tickets and returns structured JSON: {category, priority, suggested_response}."¶

Tags: senior · very-common · coding · source: 2025-2026 AI engineer take-homes; classic structured-output archetype

Scoring criteria: - Reliability of structured output: does it always return valid JSON in the declared schema? - Prompt design: clear instructions, few-shot examples, edge case handling. - Error handling: model returns malformed JSON, model hallucinates a new category, schema validation fails. - Eval: do you measure accuracy on category / priority? Have a golden set? - API design: clean endpoint, sensible status codes, batch endpoint.

Outline of a strong submission: - Schema (Pydantic):

class TicketAnalysis(BaseModel):
    category: Literal["billing", "technical", "account", "general", "other"]
    priority: Literal["low", "medium", "high", "urgent"]
    suggested_response: str
    confidence: float = Field(ge=0, le=1)

- Prompt (prompts.py): - System: define categories, priority rubric, output schema with one-shot example. - User: the ticket text. - Reinforce: "Return only valid JSON conforming to the schema." - Generation: - Use the provider's structured-output mode (OpenAI's response_format={"type": "json_schema", ...}, Anthropic tools). - Fallback: if structured mode unavailable, prompt with the schema and validate with Pydantic; retry on validation failure (1-2 retries with the error in the prompt). - API (server.py): - FastAPI or Flask. Endpoints: POST /analyze (single), POST /analyze/batch (list). - Retries on transient model errors with exponential backoff. - Cost / latency in response metadata (or only in logs). - Eval: - 30-50 labeled tickets. Compute per-class precision/recall on category; ordinal accuracy on priority (off-by-one ≤ exact, off-by-two penalized more). - LLM-judge on suggested_response: helpful, accurate, polite, addresses the issue. - README: - Schema rationale: why these categories? Why this priority rubric? - Prompt design notes: what made it stable? - Failure modes: what kinds of tickets does it get wrong? - Cost: "$X per ticket, batch endpoint amortizes".

Common follow-ups: - "What happens when the model returns invalid JSON?" - "How do you handle a new category that's not in the schema?" - "Walk me through your eval setup."

Traps: - No retry on schema validation failure. - Greedy prompt + no structured-output mode → frequent JSON breaks. - No eval on the suggested_response — that's the part users see. - One-class accuracy reported as the only metric. Need per-class.

Related cross-cutting: Production patterns, Evaluation & quality Related module: learning/01_ai_engineering/05_prompting_patterns/, learning/01_ai_engineering/22_evals_production/

Archetype 3: Web crawler + LLM enrichment¶

Q: "Given a list of company URLs, crawl each homepage and use an LLM to extract: {company_name, industry, product_summary, employee_count_estimate}. Output a CSV."¶

Tags: senior · common · coding · source: 2025-2026 AI engineer take-homes; data-extraction archetype

Scoring criteria: - Crawling: respect for robots.txt, sensible concurrency, error handling on dead URLs. - Extraction: prompt design, schema, hallucination control (especially employee_count when not on page). - Concurrency: async / threaded fan-out, rate limiting, retry logic. - Output: clean CSV, including failed rows with reason. - Evals: did you spot-check 10 outputs? Any precision claim?

Outline of a strong submission: - Crawler (crawl.py): - aiohttp + asyncio for concurrent fetch (or httpx.AsyncClient). - Respect robots.txt. Cap concurrency (~10-20 simultaneous). User-Agent set explicitly. - Timeout per request (10-15s). Retry once on 5xx / connection error. - Strip HTML via trafilatura or readability-lxml — main content only, not nav/footer junk. - LLM extraction (extract.py): - Pydantic schema:

class CompanyInfo(BaseModel):
    company_name: str
    industry: Literal[...]
    product_summary: str  # 1-2 sentences
    employee_count_estimate: Optional[Literal["1-10", "11-50", "51-200", "201-1k", "1k+"]] = None
    employee_count_source: Optional[str] = None  # "page text" or "inferred from copy"

- Critical: employee_count_estimate is Optional. If not on the page, return None — don't guess. Make this explicit in the prompt. - Few-shot the prompt with one extraction example. - Pipeline (main.py): - Read URL list, fan out crawl + extract via asyncio.gather with semaphore. - Write CSV per row as it completes. Failed rows go in with error_reason column populated. - Evals: - Spot-check 20 random rows; report your manual judgment as "extraction quality 17/20 correct, 2 partial, 1 wrong (URL was 404)". - README: - Crawl ethics: robots.txt, rate limiting, user-agent. - Hallucination control: how you prevent the model from making up employee counts. - Failure stats: of N URLs, M succeeded, K failed (broken out by reason). - Cost: "$X total for N URLs, dominated by LLM calls". - With more time: LinkedIn lookup for employee count, page-2 crawl for richer signal, ANN dedup of similar companies.

Common follow-ups: - "How do you stop the model from inventing employee counts?" - "What happens if a page is JavaScript-rendered?" - "How would you scale this to 1M URLs?"

Traps: - No robots.txt respect → instant red flag (this happens in interviews and matters). - Synchronous crawl → too slow even for the take-home dataset. - LLM is allowed to guess everything. No None fallbacks for ungrounded fields. - No CSV row for failed URLs. Disappears silently.

Related cross-cutting: Architecture choices, Production patterns Related module: learning/01_ai_engineering/05_prompting_patterns/

Archetype 4: Mini-agent with tools¶

Q: "Build an agent that can answer questions requiring web search and arithmetic. Provide a CLI."¶

Tags: senior · very-common · coding · source: 2025-2026 AI engineer take-homes; agent archetype

Scoring criteria: - Tool design: clear schemas, good descriptions, useful return values. - Loop control: termination condition, max iterations, error handling on tool failure. - Reasoning trace: can you see what the agent did and why? - Quality: does it actually solve mixed-tool questions ("what's 7% of the population of Tokyo?")? - Engineering: clean separation between agent loop, tools, and prompts.

Outline of a strong submission: - Tool definitions (tools.py):

@tool
def web_search(query: str) -> str:
    """Search the web. Returns top-3 result snippets with URLs."""
    # Real implementation via Tavily / Brave / SerpAPI / Bing.

@tool
def calculator(expression: str) -> str:
    """Evaluate a math expression. Supports +, -, *, /, **, math.sqrt, math.pi.
    Returns the numeric result as a string."""
    # Safe-eval against a whitelist; not Python's eval.

- Tool descriptions are load-bearing: the model decides whether to call them based on these. - Agent loop (agent.py): - Hand-rolled (don't pull in LangGraph for a small take-home; demonstrates control). - Loop: send message + tool list → receive response → if tool_calls present, execute and append results → repeat. Max 10 iterations. - On tool error: append the error as tool output (let the model decide to retry or give up). - Stop on stop_reason == "end_turn". - CLI (main.py): - python main.py "question" → prints answer + reasoning trace. - Reasoning trace: each step's (tool_name, tool_input, tool_output). - Evals: - 10 mixed-tool questions: pure search ("who won the 2024 Booker"), pure arithmetic ("what's 17! mod 23"), mixed ("what's 7% of Tokyo's population"). Tag each with the expected tools. - Score: did it produce the right answer? Did it use sensible tools? - README: - Tool design rationale: schema, descriptions, what you punted (e.g., no Python sandbox). - Loop control: max iterations, termination, error handling. - Trace example. - Cost: "$X per question avg, dominated by N model calls in the loop". - With more time: tool chaining via planner, fallback to general knowledge, caching of search results.

Common follow-ups: - "How do you stop infinite loops?" - "What happens if a tool returns garbage?" - "How would you add a third tool — e.g., a Python interpreter?"

Traps: - eval() on the calculator expression. Code execution sandbox issue; even in a take-home, use a safe-eval. - No max iteration cap. Demo can hang. - Tool descriptions that are too terse. Model doesn't know when to use them. - No reasoning trace. Interviewer can't debug.

Related cross-cutting: Architecture choices, Production patterns Related module: learning/01_ai_engineering/17_agents_design/

Archetype 5: Text classification + LLM judge eval¶

Q: "Build a sentiment classifier for product reviews. You can use any approach (zero-shot LLM, fine-tuned model, classical ML). Include an evaluation."¶

Tags: senior · common · coding · source: 2025-2026 AI engineer take-homes; eval-focused archetype

Scoring criteria: - Approach justification: why this method given the dataset / constraints? - Modeling: implementation, hyperparameters, model choice. - Eval rigor: stratified split, baseline comparison, metric choice. - Tradeoff awareness: latency, cost, accuracy — which did you pick and why? - Code quality + writeup.

Outline of a strong submission: - Approach (pick one + justify): - Zero-shot LLM (gpt-4o-mini or claude-haiku): fastest to build, $$ per query, ~85-90% accuracy on standard sentiment. - Fine-tuned small model (distilbert-base-uncased): cheaper per query at scale, requires labeled data + training. - Classical (logistic regression on TF-IDF): cheapest, ~80-85% accuracy, instant inference. - Pick based on the data size and constraints (which the take-home will specify or you should ask about). Bonus: implement two and compare. - Pipeline (classify.py): - If LLM: prompt + structured output (positive/negative/neutral); few-shot. - If fine-tuned: HF transformers + Trainer; reasonable hyperparams; freeze layers if dataset is small. - If classical: sklearn pipeline; TF-IDF (1-2 grams) + LR. - Eval (eval.py): - Stratified 80/10/10 train/val/test. - Per-class precision/recall/F1. - Confusion matrix. - Comparison against a strong baseline (e.g., always-predict-majority-class). - For zero-shot LLM: no train split; just test on a held-out labeled set. - README: - Why this approach for this dataset. - Numbers: F1 per class, latency per inference, cost per 1000 classifications. - Failure analysis: 5-10 misclassified examples + your interpretation of why. - With more time: ensemble, active learning to expand labels, hard-example mining.

Common follow-ups: - "Why did you pick this approach over the others?" - "Walk me through 3 of your misclassifications." - "What if the label distribution shifts in production?"

Traps: - LLM zero-shot with no eval. "It probably works." - Reporting only accuracy when classes are imbalanced. - No baseline. The interviewer can't tell if 87% is good or bad. - 95%+ accuracy claimed with no train/test split discipline. Test leak suspect.

Related cross-cutting: Evaluation & quality, Cost & latency Related module: learning/01_ai_engineering/22_evals_production/, learning/00_ai_foundation/06_adaptation_compression/

Archetype 6: End-to-end document NLP pipeline¶

Q: "Given a folder of PDFs, build an end-to-end system that: (1) extracts all named entities, (2) builds a knowledge graph of relationships, (3) lets users query the graph with natural language."¶

Tags: staff · common · coding · source: 2025-2026 senior AI engineer take-homes; complex-pipeline archetype

Scoring criteria: - Scope management: did you slice the problem into something deliverable in the time given? - Pipeline design: where does each stage live, how does state flow, how do you handle partial failures? - LLM usage discipline: when does each stage call the LLM, what's the cost profile? - Evals: at least one quality check per stage. - Writeup: this is what differentiates strong from weak submissions on a complex take-home.

Outline of a strong submission: - Scope decision (be explicit in README): - Stage 1 (entity extraction): deliver fully. spaCy + LLM hybrid (spaCy for fast pass, LLM for low-confidence or domain-specific). - Stage 2 (relationship extraction): deliver, but maybe only for top-K entity pairs to manage cost. - Stage 3 (NL query): deliver a basic version (entity lookup + 1-hop relationship traversal). Punt complex multi-hop reasoning to "future work" — explain. - Stage 1: NER: - Pass 1: spaCy's en_core_web_trf for person/org/loc/misc. Cheap. - Pass 2: LLM extract for domain-specific entities (e.g., products, financial terms). Schema-constrained output. - Merge + dedup via canonical name + alias mapping. - Stage 2: Relationship extraction: - For each pair of entities co-occurring in the same paragraph: LLM judges if there's a relationship, what type, with a confidence. - Filter low-confidence (< 0.7). - Store in a graph DB (Neo4j or a simple networkx.MultiDiGraph for the take-home). - Stage 3: NL query: - LLM parses the query into: (entity_mention, relationship_filter, hop_count). - Entity resolution against graph nodes (string match + embedding fallback). - Traverse graph; format result as text. - Eval: - Stage 1: NER F1 against a hand-labeled 20-doc subset. - Stage 2: spot-check 30 extracted relationships; report precision. - Stage 3: 10 query → expected answer pairs. - README (this is the make-or-break artifact): - Architecture diagram (ASCII is fine). - Scope decisions with rationale. - Cost estimate per doc and per query. - Known failure modes (each stage). - Roadmap for production: persistent graph DB, batched LLM calls, embedding-based entity resolution, multi-hop query planning.

Common follow-ups: - "What's the highest-impact thing you'd do with another week?" - "Walk me through one query end-to-end." - "Where does cost concentrate?"

Traps: - Trying to fully solve all three stages → nothing works end-to-end. Better to ship 2.5 stages well. - LLM-call per entity pair → blows budget on large docs. Filter candidates. - No graph visualization or sample output. Hard to evaluate. - README is a paragraph. Complex pipelines need ~3 pages of writeup.

Related cross-cutting: Architecture choices, Cost & latency Related module: learning/01_ai_engineering/05_prompting_patterns/, learning/01_ai_engineering/08_rag_system_design/

Archetype 7: Eval framework¶

Q: "Build a small evaluation framework for an LLM-based product. We'll bring our own test cases."¶

Tags: staff · occasional · coding · source: 2025-2026 senior AI engineer take-homes; eval-focused archetype

Scoring criteria: - Framework design: extensibility, separation of concerns, plug-in metrics. - Metric implementation: at least 3 (LLM-judge, semantic similarity, regex/exact match). - Reporting: not just a number — distributions, per-category breakdowns, regression detection. - Determinism: same input → same output (set seeds, version-pin models). - Engineering: clean CLI, config-driven, easy to add new test cases.

Outline of a strong submission: - Core abstractions:

class TestCase:
    id: str
    input: dict  # task-specific
    expected: Optional[Any]
    tags: list[str]

class Metric(Protocol):
    name: str
    def score(self, actual, expected, test_case) -> float: ...

class EvalRunner:
    def __init__(self, system_under_test, metrics, dataset):
        ...
    def run(self) -> EvalReport: ...

- Metrics: - ExactMatch — string equality. - SemanticSimilarity — cosine of embeddings. - LLMJudge — model rates the output on a rubric. Prompt is part of the metric. - Regex — pass/fail on a regex over the output. - Faithfulness (RAG-specific) — LLM judges whether the output is supported by provided context. - Reporting (report.py): - Per-metric distribution: mean, p50, p95, min/max. - Per-tag breakdown: which test categories pass/fail at what rates. - Diff against last run (if a baseline is provided): which test cases regressed? - HTML or markdown output. - CLI (run.py): - python run.py --dataset tests.yaml --baseline last_run.json --output report.html - README: - How to add a new metric (concrete example). - How to add a test case. - Determinism guarantees: seeds, pinned model versions. - Cost: "LLM-judge metric is ~$X per test case; pure metrics are free".

Common follow-ups: - "How would you handle a metric that depends on multiple model calls?" - "How do you decide a regression is real vs noise?" - "Walk me through your LLM-judge prompt."

Traps: - Metrics tangled with the dataset format. Make them independent. - Single-number summary. Distributions matter more than means. - LLM-judge with no calibration. State that you'd add a small set of human-rated examples to validate.

Related cross-cutting: Evaluation & quality, Production patterns Related module: learning/01_ai_engineering/22_evals_production/

Discussing your take-home in the follow-up¶

Q: "Walk me through your README."¶

Tags: senior · very-common · scenario · source: 2025-2026 AI engineer take-home debriefs (universal)

Answer outline: - Open with the problem framing (one sentence) and the architecture (one sentence). Anchor. - Walk through the major design decisions in the order they appeared. For each: what you chose, what you considered, why you picked this. - Pause on the interesting decisions — chunking, retrieval, prompt design, evals. Skip the boring ones. - Volunteer what didn't work. "I first tried X, it gave Y problem, switched to Z." This is the strongest signal — you debugged in real time. - End with "what I'd do with more time". Be specific (one or two concrete next steps, not a wishlist). - Stay quantitative. "On 50 queries, faithfulness was 0.78." > "It worked pretty well."

Common follow-ups: - "Why did you pick X over Y at this decision point?" - "What's the weakest part of your submission?" - "If I gave you another two days, what's the highest-impact thing you'd do?"

Traps: - Defending every choice. Showing you can name the weakest part is a strength. - Walking through code line-by-line instead of decisions. The code is in the repo; the thinking is what they want. - Forgetting numbers. "I think it was good" loses to "F1 0.82 on the test split".

Related cross-cutting: none — universal interview skill Related module: learning/applied_ai_interview_focus.md

Q: "What's the weakest part of your submission?"¶

Tags: senior · common · scenario · source: 2025-2026 AI engineer take-home debriefs

Answer outline: - Pick a real weakness — small enough that it doesn't disqualify, real enough that it's not theatre. Examples: - "My eval set is small — 20 examples. With another day I'd grow it and add adversarial cases." - "I didn't handle non-English input; my chunker assumes whitespace tokenization." - "Tool timeouts are global; I'd make them per-tool with sensible defaults." - State why it's a weakness in production terms (what failure mode it exposes). - Then state how you'd fix it. Specific, not hand-wavy. "I'd add tag-stratified sampling to the eval set so we don't overfit metric to the easy bucket." - Optional: state what you considered and rejected. "I considered LangChain; rejected because debugging the loop matters here and the abstraction hides the steps."

Common follow-ups: - "What's the second-weakest part?" - "How would you prioritize fixes if you had a sprint?"

Traps: - Fake weakness ("I should add more tests" — meaningless without specifics). - Refusing to name one. Reads as defensive. - Naming something so big it disqualifies you (oversharing).

Related cross-cutting: none — universal Related module: learning/applied_ai_interview_focus.md

Q: "How would you make this production-ready?"¶

Tags: senior · very-common · scenario · source: 2025-2026 AI engineer take-home debriefs

Answer outline: - Group the work, don't list 50 items: - Reliability: retries, circuit breakers, graceful degradation on dependency failure, per-tenant rate limits. Mention timeouts on every external call. - Observability: structured logs, traces (OpenTelemetry), per-request cost tracking, eval pass-rate dashboard. See observability-tracing.md. - Quality: continuous eval suite (golden + adversarial), online judge sampling X% of traffic, regression alerts. - Cost: prompt caching, response caching, smaller model on the hot path with fallback to larger, batching where possible. - Security: auth, input validation, prompt injection defenses, PII redaction in logs. - Deployment: containerized, autoscaling, canary or shadow rollout for changes, versioned model + prompt pins. - Pick the 2-3 you'd prioritize first and say why (the ones with highest risk on the current shape of the system). - Avoid sounding like a checklist. Tie each to a specific thing your take-home would suffer without.

Common follow-ups: - "Which one would you do first?" - "How do you decide if a change to prompts goes live?"

Traps: - Generic 50-item list. Sounds rote. - Forgetting eval continuity. Production AI without continuous evals drifts.

Related cross-cutting: Production patterns Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/01_ai_engineering/04_resilient_agent_systems/