Practical Take-Home Prompts — Interview Questions¶
The "you have 2-6 hours, build a working slice and write it up" round. Different from ml-coding-rounds.md (from-scratch ML primitives) and classic-algo.md (DSA-flavored). This file is the realistic miniature project interview: small but end-to-end, requires real judgment on architecture, error handling, prompts, evals, and tradeoffs.
The senior tell is not the cleverness of the code. It's a short, honest README — what you built, what you punted, what you'd do next, what the test cases are, what the failure modes are. Treat the writeup as part of the deliverable; many teams weight it equally with the code.
The takehomes here are real archetypes from 2025-2026 AI engineer loops. Each one includes: prompt, scoring criteria, an outline of what a strong submission looks like, common traps, and how to talk about it in the followup interview ("walk me through your README").
Universal advice¶
What gets you a strong signal¶
- Working code with a one-command run:
pip install -r requirements.txt && python main.py. Or a Dockerfile. Anything more than two steps loses signal. - A short README (~1-2 pages) covering: problem framing, design decisions, what you punted and why, how to test, sample outputs, what you'd do with more time.
- Test cases that demonstrate the system handling: golden case, edge case, adversarial case. Even 3-5 tests is enough.
- One eval of any kind. Even a 10-example handmade golden set with a simple correctness function beats no eval.
- Latency / cost honesty. Add a paragraph: "this costs $X per call, p95 latency is Y, here's where time is spent". Demonstrates production thinking.
- Naming model versions explicitly.
gpt-4o-2024-08-06, notgpt-4o. Mention why. - Acknowledge what you didn't do: "no auth, no rate limiting, no DB — would add for prod". Better than leaving the interviewer to wonder.
What loses signal¶
- 800-line
main.pydoing everything. Even simple modularity (retrieval.py,llm.py,prompts.py,evals.py) signals seniority. - No prompts version-controlled. Prompts inline in code, no comments, no rationale.
- No mention of cost, latency, or eval. You shipped without thinking about production.
- Mocked the LLM entirely. Even one real call with a real provider beats a stubbed transformer.
- Untested error paths. Show what happens on retrieval miss, model timeout, malformed output.
- "I would have done X if I had more time" — but X is the heart of the problem. If asked to build a RAG pipeline, don't punt retrieval.
- Over-engineering. The 4-hour take-home with 12 files, a custom logging framework, and a config DSL signals poor judgment about scope.
How to talk about it in the follow-up¶
- Walk through the README, not the code. The interviewer wants to hear your thinking.
- Be specific about decisions: "I picked top-K=5 because at K=10 I saw context-stuffing hurt faithfulness on my evals."
- Volunteer the tradeoffs you considered and rejected: "I considered hybrid retrieval; for the corpus size it wasn't worth the complexity."
- Say what's broken. Volunteering known issues earns trust.
- Quantify when you can. "On 50 test queries, faithfulness was 0.78."
Archetype 1: RAG over a small corpus¶
Q: "Build a Q&A system over the included 50 PDFs. We'll ask questions; you return cited answers."¶
Tags: senior · very-common · coding · source: 2025-2026 AI engineer take-homes; standard RAG archetype
Scoring criteria the loop will use: - Ingestion: how documents become chunks. Do you choose chunk size sensibly? Strip headers/footers? Preserve doc/page metadata? - Retrieval: which embeddings, which DB, what top-K, any reranking? Why? - Generation: prompt design, citation format, refusal behavior when no good context. - Quality: are answers grounded? Cited? Does it refuse on out-of-corpus queries? - Evals: do you have any? Even a small handmade set with faithfulness scoring counts. - Engineering: code structure, README clarity, repeatability.
Outline of a strong submission:
- Stack: pgvector or Qdrant (or even FAISS if local-only), text-embedding-3-small, gpt-4o-mini for generation, judge-model gpt-4o for evals.
- Ingestion (ingest.py):
- PDF → text via pypdf or pdfplumber.
- Chunk to ~500-800 tokens with 50-100 token overlap. Why: most papers have ~300-700 token paragraphs; this keeps full paragraphs together while preserving cross-chunk context.
- Strip headers/footers via heuristic (lines appearing >X% of pages).
- Preserve (doc_id, page_num, chunk_idx) metadata on every chunk.
- Embed and upsert.
- Retrieval (retrieval.py):
- Top-K dense retrieval (K=5-10).
- Optional: cross-encoder reranker if quality demands; mention bge-reranker-base or cohere-rerank-3.
- Mention you considered BM25/hybrid; decided not worth complexity at this corpus size (or did include — explain why).
- Generation (answer.py):
- Prompt template includes: system rules ("answer only from context, cite source"), retrieved chunks with [doc_id p.N] markers, user question.
- Citation format: [doc.pdf p.3] inline. Parse and verify all citations refer to chunks actually in context.
- Refusal: if top-K retrieval scores are all below threshold, return "I don't have enough info in the corpus" with no citations.
- Evals (eval.py):
- 10-20 hand-crafted Q&A pairs with reference answers.
- Metrics: faithfulness (LLM-judge: does the answer follow from the context?), citation precision (does each cited chunk actually support its claim?), recall on the reference (similarity).
- Print a table: per-question score + aggregate.
- README:
- One paragraph each on the four design decisions (chunking, retrieval, prompting, eval).
- Sample question + cited answer.
- Cost: "$X per query at K=5, embedding cost $Y for full ingest".
- Latency: "p50 ~1.2s, p95 ~2.5s".
- Known limitations: handles only English text PDFs; tables/figures ignored.
- With more time: hybrid retrieval, query expansion, multi-hop, caching.
Common follow-ups in the interview: - "Why these chunk sizes?" - "Walk me through a failing query and how you'd fix it." - "What's your eval missing?" - "Suppose the corpus grows to 10M docs — what changes?"
Traps: - No metadata on chunks → citations are vague or fabricated. - No refusal behavior → model confabulates on out-of-corpus questions. - 5000-token chunks → poor retrieval precision, blown context budget. - Zero evals. Even 10 manual examples is fine; zero is a signal.
Related cross-cutting: Architecture choices, Evaluation & quality
Related module: learning/01_ai_engineering/08_rag_system_design/, learning/01_ai_engineering/14_retrieval_ranking/
Archetype 2: LLM + structured output¶
Q: "Build a service that takes free-text customer support tickets and returns structured JSON: {category, priority, suggested_response}."¶
Tags: senior · very-common · coding · source: 2025-2026 AI engineer take-homes; classic structured-output archetype
Scoring criteria: - Reliability of structured output: does it always return valid JSON in the declared schema? - Prompt design: clear instructions, few-shot examples, edge case handling. - Error handling: model returns malformed JSON, model hallucinates a new category, schema validation fails. - Eval: do you measure accuracy on category / priority? Have a golden set? - API design: clean endpoint, sensible status codes, batch endpoint.
Outline of a strong submission: - Schema (Pydantic):
class TicketAnalysis(BaseModel):
category: Literal["billing", "technical", "account", "general", "other"]
priority: Literal["low", "medium", "high", "urgent"]
suggested_response: str
confidence: float = Field(ge=0, le=1)
prompts.py):
- System: define categories, priority rubric, output schema with one-shot example.
- User: the ticket text.
- Reinforce: "Return only valid JSON conforming to the schema."
- Generation:
- Use the provider's structured-output mode (OpenAI's response_format={"type": "json_schema", ...}, Anthropic tools).
- Fallback: if structured mode unavailable, prompt with the schema and validate with Pydantic; retry on validation failure (1-2 retries with the error in the prompt).
- API (server.py):
- FastAPI or Flask. Endpoints: POST /analyze (single), POST /analyze/batch (list).
- Retries on transient model errors with exponential backoff.
- Cost / latency in response metadata (or only in logs).
- Eval:
- 30-50 labeled tickets. Compute per-class precision/recall on category; ordinal accuracy on priority (off-by-one ≤ exact, off-by-two penalized more).
- LLM-judge on suggested_response: helpful, accurate, polite, addresses the issue.
- README:
- Schema rationale: why these categories? Why this priority rubric?
- Prompt design notes: what made it stable?
- Failure modes: what kinds of tickets does it get wrong?
- Cost: "$X per ticket, batch endpoint amortizes".
Common follow-ups: - "What happens when the model returns invalid JSON?" - "How do you handle a new category that's not in the schema?" - "Walk me through your eval setup."
Traps: - No retry on schema validation failure. - Greedy prompt + no structured-output mode → frequent JSON breaks. - No eval on the suggested_response — that's the part users see. - One-class accuracy reported as the only metric. Need per-class.
Related cross-cutting: Production patterns, Evaluation & quality
Related module: learning/01_ai_engineering/05_prompting_patterns/, learning/01_ai_engineering/22_evals_production/
Archetype 3: Web crawler + LLM enrichment¶
Q: "Given a list of company URLs, crawl each homepage and use an LLM to extract: {company_name, industry, product_summary, employee_count_estimate}. Output a CSV."¶
Tags: senior · common · coding · source: 2025-2026 AI engineer take-homes; data-extraction archetype
Scoring criteria: - Crawling: respect for robots.txt, sensible concurrency, error handling on dead URLs. - Extraction: prompt design, schema, hallucination control (especially employee_count when not on page). - Concurrency: async / threaded fan-out, rate limiting, retry logic. - Output: clean CSV, including failed rows with reason. - Evals: did you spot-check 10 outputs? Any precision claim?
Outline of a strong submission:
- Crawler (crawl.py):
- aiohttp + asyncio for concurrent fetch (or httpx.AsyncClient).
- Respect robots.txt. Cap concurrency (~10-20 simultaneous). User-Agent set explicitly.
- Timeout per request (10-15s). Retry once on 5xx / connection error.
- Strip HTML via trafilatura or readability-lxml — main content only, not nav/footer junk.
- LLM extraction (extract.py):
- Pydantic schema:
class CompanyInfo(BaseModel):
company_name: str
industry: Literal[...]
product_summary: str # 1-2 sentences
employee_count_estimate: Optional[Literal["1-10", "11-50", "51-200", "201-1k", "1k+"]] = None
employee_count_source: Optional[str] = None # "page text" or "inferred from copy"
employee_count_estimate is Optional. If not on the page, return None — don't guess. Make this explicit in the prompt.
- Few-shot the prompt with one extraction example.
- Pipeline (main.py):
- Read URL list, fan out crawl + extract via asyncio.gather with semaphore.
- Write CSV per row as it completes. Failed rows go in with error_reason column populated.
- Evals:
- Spot-check 20 random rows; report your manual judgment as "extraction quality 17/20 correct, 2 partial, 1 wrong (URL was 404)".
- README:
- Crawl ethics: robots.txt, rate limiting, user-agent.
- Hallucination control: how you prevent the model from making up employee counts.
- Failure stats: of N URLs, M succeeded, K failed (broken out by reason).
- Cost: "$X total for N URLs, dominated by LLM calls".
- With more time: LinkedIn lookup for employee count, page-2 crawl for richer signal, ANN dedup of similar companies.
Common follow-ups: - "How do you stop the model from inventing employee counts?" - "What happens if a page is JavaScript-rendered?" - "How would you scale this to 1M URLs?"
Traps: - No robots.txt respect → instant red flag (this happens in interviews and matters). - Synchronous crawl → too slow even for the take-home dataset. - LLM is allowed to guess everything. No None fallbacks for ungrounded fields. - No CSV row for failed URLs. Disappears silently.
Related cross-cutting: Architecture choices, Production patterns
Related module: learning/01_ai_engineering/05_prompting_patterns/
Archetype 4: Mini-agent with tools¶
Q: "Build an agent that can answer questions requiring web search and arithmetic. Provide a CLI."¶
Tags: senior · very-common · coding · source: 2025-2026 AI engineer take-homes; agent archetype
Scoring criteria: - Tool design: clear schemas, good descriptions, useful return values. - Loop control: termination condition, max iterations, error handling on tool failure. - Reasoning trace: can you see what the agent did and why? - Quality: does it actually solve mixed-tool questions ("what's 7% of the population of Tokyo?")? - Engineering: clean separation between agent loop, tools, and prompts.
Outline of a strong submission:
- Tool definitions (tools.py):
@tool
def web_search(query: str) -> str:
"""Search the web. Returns top-3 result snippets with URLs."""
# Real implementation via Tavily / Brave / SerpAPI / Bing.
@tool
def calculator(expression: str) -> str:
"""Evaluate a math expression. Supports +, -, *, /, **, math.sqrt, math.pi.
Returns the numeric result as a string."""
# Safe-eval against a whitelist; not Python's eval.
agent.py):
- Hand-rolled (don't pull in LangGraph for a small take-home; demonstrates control).
- Loop: send message + tool list → receive response → if tool_calls present, execute and append results → repeat. Max 10 iterations.
- On tool error: append the error as tool output (let the model decide to retry or give up).
- Stop on stop_reason == "end_turn".
- CLI (main.py):
- python main.py "question" → prints answer + reasoning trace.
- Reasoning trace: each step's (tool_name, tool_input, tool_output).
- Evals:
- 10 mixed-tool questions: pure search ("who won the 2024 Booker"), pure arithmetic ("what's 17! mod 23"), mixed ("what's 7% of Tokyo's population"). Tag each with the expected tools.
- Score: did it produce the right answer? Did it use sensible tools?
- README:
- Tool design rationale: schema, descriptions, what you punted (e.g., no Python sandbox).
- Loop control: max iterations, termination, error handling.
- Trace example.
- Cost: "$X per question avg, dominated by N model calls in the loop".
- With more time: tool chaining via planner, fallback to general knowledge, caching of search results.
Common follow-ups: - "How do you stop infinite loops?" - "What happens if a tool returns garbage?" - "How would you add a third tool — e.g., a Python interpreter?"
Traps:
- eval() on the calculator expression. Code execution sandbox issue; even in a take-home, use a safe-eval.
- No max iteration cap. Demo can hang.
- Tool descriptions that are too terse. Model doesn't know when to use them.
- No reasoning trace. Interviewer can't debug.
Related cross-cutting: Architecture choices, Production patterns
Related module: learning/01_ai_engineering/17_agents_design/
Archetype 5: Text classification + LLM judge eval¶
Q: "Build a sentiment classifier for product reviews. You can use any approach (zero-shot LLM, fine-tuned model, classical ML). Include an evaluation."¶
Tags: senior · common · coding · source: 2025-2026 AI engineer take-homes; eval-focused archetype
Scoring criteria: - Approach justification: why this method given the dataset / constraints? - Modeling: implementation, hyperparameters, model choice. - Eval rigor: stratified split, baseline comparison, metric choice. - Tradeoff awareness: latency, cost, accuracy — which did you pick and why? - Code quality + writeup.
Outline of a strong submission:
- Approach (pick one + justify):
- Zero-shot LLM (gpt-4o-mini or claude-haiku): fastest to build, $$ per query, ~85-90% accuracy on standard sentiment.
- Fine-tuned small model (distilbert-base-uncased): cheaper per query at scale, requires labeled data + training.
- Classical (logistic regression on TF-IDF): cheapest, ~80-85% accuracy, instant inference.
- Pick based on the data size and constraints (which the take-home will specify or you should ask about). Bonus: implement two and compare.
- Pipeline (classify.py):
- If LLM: prompt + structured output (positive/negative/neutral); few-shot.
- If fine-tuned: HF transformers + Trainer; reasonable hyperparams; freeze layers if dataset is small.
- If classical: sklearn pipeline; TF-IDF (1-2 grams) + LR.
- Eval (eval.py):
- Stratified 80/10/10 train/val/test.
- Per-class precision/recall/F1.
- Confusion matrix.
- Comparison against a strong baseline (e.g., always-predict-majority-class).
- For zero-shot LLM: no train split; just test on a held-out labeled set.
- README:
- Why this approach for this dataset.
- Numbers: F1 per class, latency per inference, cost per 1000 classifications.
- Failure analysis: 5-10 misclassified examples + your interpretation of why.
- With more time: ensemble, active learning to expand labels, hard-example mining.
Common follow-ups: - "Why did you pick this approach over the others?" - "Walk me through 3 of your misclassifications." - "What if the label distribution shifts in production?"
Traps: - LLM zero-shot with no eval. "It probably works." - Reporting only accuracy when classes are imbalanced. - No baseline. The interviewer can't tell if 87% is good or bad. - 95%+ accuracy claimed with no train/test split discipline. Test leak suspect.
Related cross-cutting: Evaluation & quality, Cost & latency
Related module: learning/01_ai_engineering/22_evals_production/, learning/00_ai_foundation/06_adaptation_compression/
Archetype 6: End-to-end document NLP pipeline¶
Q: "Given a folder of PDFs, build an end-to-end system that: (1) extracts all named entities, (2) builds a knowledge graph of relationships, (3) lets users query the graph with natural language."¶
Tags: staff · common · coding · source: 2025-2026 senior AI engineer take-homes; complex-pipeline archetype
Scoring criteria: - Scope management: did you slice the problem into something deliverable in the time given? - Pipeline design: where does each stage live, how does state flow, how do you handle partial failures? - LLM usage discipline: when does each stage call the LLM, what's the cost profile? - Evals: at least one quality check per stage. - Writeup: this is what differentiates strong from weak submissions on a complex take-home.
Outline of a strong submission:
- Scope decision (be explicit in README):
- Stage 1 (entity extraction): deliver fully. spaCy + LLM hybrid (spaCy for fast pass, LLM for low-confidence or domain-specific).
- Stage 2 (relationship extraction): deliver, but maybe only for top-K entity pairs to manage cost.
- Stage 3 (NL query): deliver a basic version (entity lookup + 1-hop relationship traversal). Punt complex multi-hop reasoning to "future work" — explain.
- Stage 1: NER:
- Pass 1: spaCy's en_core_web_trf for person/org/loc/misc. Cheap.
- Pass 2: LLM extract for domain-specific entities (e.g., products, financial terms). Schema-constrained output.
- Merge + dedup via canonical name + alias mapping.
- Stage 2: Relationship extraction:
- For each pair of entities co-occurring in the same paragraph: LLM judges if there's a relationship, what type, with a confidence.
- Filter low-confidence (< 0.7).
- Store in a graph DB (Neo4j or a simple networkx.MultiDiGraph for the take-home).
- Stage 3: NL query:
- LLM parses the query into: (entity_mention, relationship_filter, hop_count).
- Entity resolution against graph nodes (string match + embedding fallback).
- Traverse graph; format result as text.
- Eval:
- Stage 1: NER F1 against a hand-labeled 20-doc subset.
- Stage 2: spot-check 30 extracted relationships; report precision.
- Stage 3: 10 query → expected answer pairs.
- README (this is the make-or-break artifact):
- Architecture diagram (ASCII is fine).
- Scope decisions with rationale.
- Cost estimate per doc and per query.
- Known failure modes (each stage).
- Roadmap for production: persistent graph DB, batched LLM calls, embedding-based entity resolution, multi-hop query planning.
Common follow-ups: - "What's the highest-impact thing you'd do with another week?" - "Walk me through one query end-to-end." - "Where does cost concentrate?"
Traps: - Trying to fully solve all three stages → nothing works end-to-end. Better to ship 2.5 stages well. - LLM-call per entity pair → blows budget on large docs. Filter candidates. - No graph visualization or sample output. Hard to evaluate. - README is a paragraph. Complex pipelines need ~3 pages of writeup.
Related cross-cutting: Architecture choices, Cost & latency
Related module: learning/01_ai_engineering/05_prompting_patterns/, learning/01_ai_engineering/08_rag_system_design/
Archetype 7: Eval framework¶
Q: "Build a small evaluation framework for an LLM-based product. We'll bring our own test cases."¶
Tags: staff · occasional · coding · source: 2025-2026 senior AI engineer take-homes; eval-focused archetype
Scoring criteria: - Framework design: extensibility, separation of concerns, plug-in metrics. - Metric implementation: at least 3 (LLM-judge, semantic similarity, regex/exact match). - Reporting: not just a number — distributions, per-category breakdowns, regression detection. - Determinism: same input → same output (set seeds, version-pin models). - Engineering: clean CLI, config-driven, easy to add new test cases.
Outline of a strong submission: - Core abstractions:
class TestCase:
id: str
input: dict # task-specific
expected: Optional[Any]
tags: list[str]
class Metric(Protocol):
name: str
def score(self, actual, expected, test_case) -> float: ...
class EvalRunner:
def __init__(self, system_under_test, metrics, dataset):
...
def run(self) -> EvalReport: ...
ExactMatch — string equality.
- SemanticSimilarity — cosine of embeddings.
- LLMJudge — model rates the output on a rubric. Prompt is part of the metric.
- Regex — pass/fail on a regex over the output.
- Faithfulness (RAG-specific) — LLM judges whether the output is supported by provided context.
- Reporting (report.py):
- Per-metric distribution: mean, p50, p95, min/max.
- Per-tag breakdown: which test categories pass/fail at what rates.
- Diff against last run (if a baseline is provided): which test cases regressed?
- HTML or markdown output.
- CLI (run.py):
- python run.py --dataset tests.yaml --baseline last_run.json --output report.html
- README:
- How to add a new metric (concrete example).
- How to add a test case.
- Determinism guarantees: seeds, pinned model versions.
- Cost: "LLM-judge metric is ~$X per test case; pure metrics are free".
Common follow-ups: - "How would you handle a metric that depends on multiple model calls?" - "How do you decide a regression is real vs noise?" - "Walk me through your LLM-judge prompt."
Traps: - Metrics tangled with the dataset format. Make them independent. - Single-number summary. Distributions matter more than means. - LLM-judge with no calibration. State that you'd add a small set of human-rated examples to validate.
Related cross-cutting: Evaluation & quality, Production patterns
Related module: learning/01_ai_engineering/22_evals_production/
Discussing your take-home in the follow-up¶
Q: "Walk me through your README."¶
Tags: senior · very-common · scenario · source: 2025-2026 AI engineer take-home debriefs (universal)
Answer outline: - Open with the problem framing (one sentence) and the architecture (one sentence). Anchor. - Walk through the major design decisions in the order they appeared. For each: what you chose, what you considered, why you picked this. - Pause on the interesting decisions — chunking, retrieval, prompt design, evals. Skip the boring ones. - Volunteer what didn't work. "I first tried X, it gave Y problem, switched to Z." This is the strongest signal — you debugged in real time. - End with "what I'd do with more time". Be specific (one or two concrete next steps, not a wishlist). - Stay quantitative. "On 50 queries, faithfulness was 0.78." > "It worked pretty well."
Common follow-ups: - "Why did you pick X over Y at this decision point?" - "What's the weakest part of your submission?" - "If I gave you another two days, what's the highest-impact thing you'd do?"
Traps: - Defending every choice. Showing you can name the weakest part is a strength. - Walking through code line-by-line instead of decisions. The code is in the repo; the thinking is what they want. - Forgetting numbers. "I think it was good" loses to "F1 0.82 on the test split".
Related cross-cutting: none — universal interview skill
Related module: learning/applied_ai_interview_focus.md
Q: "What's the weakest part of your submission?"¶
Tags: senior · common · scenario · source: 2025-2026 AI engineer take-home debriefs
Answer outline: - Pick a real weakness — small enough that it doesn't disqualify, real enough that it's not theatre. Examples: - "My eval set is small — 20 examples. With another day I'd grow it and add adversarial cases." - "I didn't handle non-English input; my chunker assumes whitespace tokenization." - "Tool timeouts are global; I'd make them per-tool with sensible defaults." - State why it's a weakness in production terms (what failure mode it exposes). - Then state how you'd fix it. Specific, not hand-wavy. "I'd add tag-stratified sampling to the eval set so we don't overfit metric to the easy bucket." - Optional: state what you considered and rejected. "I considered LangChain; rejected because debugging the loop matters here and the abstraction hides the steps."
Common follow-ups: - "What's the second-weakest part?" - "How would you prioritize fixes if you had a sprint?"
Traps: - Fake weakness ("I should add more tests" — meaningless without specifics). - Refusing to name one. Reads as defensive. - Naming something so big it disqualifies you (oversharing).
Related cross-cutting: none — universal
Related module: learning/applied_ai_interview_focus.md
Q: "How would you make this production-ready?"¶
Tags: senior · very-common · scenario · source: 2025-2026 AI engineer take-home debriefs
Answer outline:
- Group the work, don't list 50 items:
- Reliability: retries, circuit breakers, graceful degradation on dependency failure, per-tenant rate limits. Mention timeouts on every external call.
- Observability: structured logs, traces (OpenTelemetry), per-request cost tracking, eval pass-rate dashboard. See observability-tracing.md.
- Quality: continuous eval suite (golden + adversarial), online judge sampling X% of traffic, regression alerts.
- Cost: prompt caching, response caching, smaller model on the hot path with fallback to larger, batching where possible.
- Security: auth, input validation, prompt injection defenses, PII redaction in logs.
- Deployment: containerized, autoscaling, canary or shadow rollout for changes, versioned model + prompt pins.
- Pick the 2-3 you'd prioritize first and say why (the ones with highest risk on the current shape of the system).
- Avoid sounding like a checklist. Tie each to a specific thing your take-home would suffer without.
Common follow-ups: - "Which one would you do first?" - "How do you decide if a change to prompts goes live?"
Traps: - Generic 50-item list. Sounds rote. - Forgetting eval continuity. Production AI without continuous evals drifts.
Related cross-cutting: Production patterns
Related module: learning/02_ai_infrastructure/04_ml_platform_operations/, learning/01_ai_engineering/04_resilient_agent_systems/