08. AI Code Review — review behavior, not just syntax¶

~12 min read. A prompt diff without context is not a real review.

Built on the ELI5 in 00-eli5.md. The crew — reminder of shared responsibility — depends on review clarity more than cleverness.

1) Review in AI systems must ask bigger questions¶

Look. A normal code review checks correctness, readability, and maintainability. AI code review must do that, but not stop there. Behavior can change even when syntax looks harmless. That is the catch.

A tiny prompt edit can change refusal style. A retrieval tweak can change evidence quality. A tool permission change can widen blast radius. A model switch can change latency and cost. The course can shift from one line diff. That is why AI review must look beyond code shape.

See. If a reviewer only checks syntax, they miss the real risk. The compass should guide questions, not just approvals. The weather check should appear before merge, not after incident. The ship's log should explain the reason behind important change. That is how the crew stays aligned.

Here is the picture first.

┌─────────────────────┐
│ Diff looks small    │
├─────────────────────┤
│ Prompt changed      │
│ Tool path changed   │
│ Retrieval changed   │
│ Model changed       │
└─────────────────────┘
          ▼
┌─────────────────────┐
│ Review asks         │
│ What behavior moved?│
│ Which evals cover it│
│ What is fallback?   │
└─────────────────────┘

Picture before rules. Small diff, big behavior change. Simple, no? That is the core principle.

2) What a strong AI review actually checks¶

So what to do? Review the changed behavior surface. Ask what moved in prompts, tools, retrieval, models, and policies. Ask what stayed protected. Ask what became weaker. Those questions matter more than ornamental comments.

A strong review usually asks these questions. What changed in the prompt or system instruction? What changed in retrieval sources, ranking, or filters? What changed in tool access, routing, or fallback logic? Which evals cover the new path? What is the fallback if the new path fails? Did latency or cost move? Did any new failure mode appear?

Look. A prompt diff alone is not reviewable. Reviewers need the surrounding purpose. What user task changed? Which examples improved? Which examples got riskier? Without that context, the diff is theatre. The crew cannot make a serious decision from fragments.

Good reviews also ask for evidence. Show the relevant eval slice. Show one or two traces if workflow shape changed. Show fallback behavior. Show rollback plan when risk is high. That is the weather check in practical form. Yes? Evidence makes judgment calmer.

3) Review clarity beats review cleverness¶

One unhealthy pattern appears often. Authors drop a clever prompt rewrite with almost no explanation. Reviewers reply with vague praise. Merge happens because nobody wants to look slow. Later, the system behaves strangely. Now everyone acts surprised.

See the real issue. The change was not reviewable. A good reviewer cannot infer intent magically. The ship's log starts during review, not after outage. If the author does not explain intent, evidence, and fallback, the review is incomplete. That is not bureaucracy. That is kindness to the future.

So what should the author include? State the task being changed. State the expected behavior shift. State the evals used. State the fallback path. State the cost or latency impact. State the open risk if one still exists. The compass becomes visible when tradeoffs are written plainly.

So what should the reviewer reward? Clarity. Repro steps. Honest risk notes. Small, legible diffs. Good naming. Useful comments. Not clever opacity. Not "trust me" energy. Simple, no?

4) Review should search for failure modes¶

Review is not only about what should happen. Review is also about what could fail next. That is where AI review becomes more mature.

Ask failure-mode questions directly. What if retrieval returns weak context? What if the tool times out? What if the model refuses too often? What if safety rules become softer? What if the answer becomes slower and more expensive? What if the fallback path never runs in practice? The weather check becomes sharper when these questions are normal.

Look at one practical review flow.

┌──────────────┐
│ Intent       │ what task changed?
├──────────────┤
│ Evidence     │ which evals and traces support it?
├──────────────┤
│ Fallback     │ what happens on failure?
├──────────────┤
│ Cost/Latency │ what moved and why?
├──────────────┤
│ Risks        │ what new failure modes exist?
└──────────────┘

This flow protects the course from accidental drift. It also protects the crew from silent surprises. And it keeps the ship's log useful for later decisions. Review becomes a teaching system, not just a gate. Yes? That is why healthy teams take it seriously.

5) Review culture shapes product quality¶

Weak review culture creates fast-looking teams. Then debt grows in corners. Then failures get explained away as AI weirdness. That story is lazy. Often the issue was weak review discipline.

Strong review culture feels different. People ask precise questions without ego. They request evidence without drama. They praise clarity, not mystique. They make risk review normal before merge. They leave a better ship's log for the next engineer. That helps the whole crew learn faster.

So what to do? Make review templates explicit. Require context for prompt, retrieval, tool, and model changes. Link the relevant eval results. Call out fallback behavior. Note cost and latency deltas. Reject changes that are impossible to reason about. That is not slowing down. That is how quality scales.

Where this lives in the wild¶

Customer support copilot — senior AI engineer reviews prompt changes with citation evals and fallback screenshots.
Enterprise search assistant — platform engineer checks retrieval diffs, tenant filters, and trace evidence.
Coding agent for developers — staff engineer reviews tool permissions, pass@k evals, and rollback plan.
Sales-call summarizer — product engineer checks latency impact, judge evals, and empty-transcript fallback.
Legal drafting assistant — applied scientist reviews refusal behavior, citation quality, and cost deltas.

Pause and recall¶

Why is a prompt diff without context not truly reviewable?
Which review questions belong to AI systems but not ordinary CRUD code?
Why should fallback paths appear in review discussion?
How does review culture reduce silent debt growth?

Interview Q&A¶

Q1. What extra questions should code review ask for AI systems? A. Ask what changed in prompts, retrieval, tools, models, eval coverage, fallback, and cost or latency. These are the main behavior surfaces. Common wrong answer to avoid: "Just run lint and check the diff carefully."

Q2. Why is a prompt diff alone not enough for review? A. Because reviewers need task intent, examples, eval evidence, and failure expectations. Without context, the diff is not meaningfully inspectable. Common wrong answer to avoid: "Prompting is subjective, so review is mostly taste."

Q3. What does good review culture reward? A. Clarity, evidence, legibility, and honest tradeoff notes. It should not reward clever but opaque changes. Common wrong answer to avoid: "The best reviewer is the one who notices the most syntax nits."

Q4. Why discuss cost and latency during review? A. Because behavior quality is not the only production outcome that can change. A better answer that breaks latency or cost goals may still be the wrong change. Common wrong answer to avoid: "Performance review can wait until after deployment."

Apply now (5 min)¶

Exercise: Take one recent AI change you know. Write the review summary you wish had existed. Include changed surface, eval evidence, fallback, cost effect, latency effect, and one new risk. Then write one sentence saying whether the diff is reviewable as-is.

Sketch from memory: Draw the review flow with Intent, Evidence, Fallback, Cost/Latency, and Risks. Add one note showing where the compass helps a reviewer decide. Add one note showing what should be written after approval.

Bridge. Review culture survives only when planning gives it time and respect. Next, see how sprint planning changes when work contains research, ambiguity, and quality tasks. → 09-sprint-planning-research.md