Skip to content

09. Code review AI — reading diffs like a suspicious senior engineer

~13 min read. Review models are not writing code from scratch. They are judging change, context, and risk.

Built on the ELI5 in 00-eli5.md. The translator is no longer selling a new item. It is comparing the old and new bargain before the market vendor accepts it.


Review is about deltas, not whole files

Code review asks a different question from generation. Not, "Can you write code?" But, "What changed, and what risk did that change introduce?" That means diffs matter more than full files. Added lines. Removed guards. Changed defaults. New API calls. Moved permission checks. These are review signals.

A good review model first localizes the delta. Then it asks what that delta interacts with. Callers. State transitions. Auth rules. Error handling. Tests. The phrasebook here is the diff plus its nearby dependency context. Simple, no?

old code ──┐
           ├──→ diff focus ──→ retrieve nearby context ──→ review comment
new code ──┘

This is why naive whole-repo summarization is weak for review. You need sharp attention on changed behavior. Then selective expansion outward. Not the other way around.

Review models are strongest on specific risk patterns

They often catch missing null checks. They often catch off-by-one bugs. They often catch unsafe string interpolation into SQL. They often catch removed auth checks. They often catch error swallowing or ignored return values. That is useful because these patterns are local and well represented in training.

They are weaker on deep architectural mismatch. They are weaker on business-rule nuance. They are weaker when the real bug lives in an unchanged file. They are weaker when tests are weak and the diff looks harmless. So what to do? Use AI review as a fast first pass, not final authority. Yes?

A helpful practice is instructing the model to prioritize bug risk over style. Style comments are cheap and noisy. Risk comments are valuable. Senior engineers want signal. Not grammar policing. Look. Good review AI feels like a suspicious teammate, not a formatter.

Worked numerical example: pagination bug in a diff

Suppose the old code was:

def offset(page, limit):
    return (page - 1) * limit

Now the diff changes it to:

def offset(page, limit):
    return page * limit

At a glance, it still looks tidy. No syntax issue. No type issue. But test with numbers. If page = 1 and limit = 20, old offset is (1 - 1) * 20 = 0. That starts at row 1. New offset is 1 * 20 = 20. That starts at row 21. Wrong.

Try page = 3 and limit = 20. Old offset is (3 - 1) * 20 = 40. New offset is 3 * 20 = 60. Again wrong by 20. The bug is systematic. The review comment should say exactly that. The model does not need huge theory here. It needs careful delta reasoning.

See the pattern. One changed token. Big behavioral shift. That is why diff review is high leverage.

Strong review systems gather just enough extra context

A diff alone may show the pagination formula. Nearby context may show that page numbers are one-indexed in the API contract. Tests may show expected offsets. Docs may show UI numbering. A strong review assistant pulls these in before commenting. Not the whole repo. Just the supporting phrasebook around the change.

It should also ground comments in evidence. Point to the exact line. State the failing input. State the effect. Maybe suggest a fix. That is far better than vague notes like "pagination might be wrong." A good review comment is a mini receipt. Concrete. Checkable. Actionable.

Another useful split is security versus correctness. Security review may prioritize taint flow, secrets, auth checks, and command execution. Correctness review may prioritize invariants, boundaries, and state transitions. You can ask the translator to inspect one lens at a time. That often improves signal.

The limits are important

Review AI can miss bugs in unchanged code. It can miss hidden assumptions from product policy. It can over-comment on harmless patterns. It can hallucinate impossible failures if the local context is thin. So what to do? Pair it with tests. Pair it with human review. Measure precision, not just comment count. Simple, no?

The best mindset is modest. AI review is a scalable suspicion engine. It helps humans scan more surface area quickly. It does not replace system understanding. That honest framing keeps trust high.


Where this lives in the wild

  • GitHub Copilot code review — pull request author: gets automated comments on diffs before a human reviewer arrives.
  • CodeRabbit — startup engineering manager: uses AI review to flag likely correctness and security risks across many daily PRs.
  • Snyk Code — application security engineer: scans changed code for vulnerability patterns like injection and unsafe handling.
  • GitLab Duo code review — platform team lead: adds automated diff analysis inside merge request workflows.
  • Amazon CodeGuru Reviewer — Java maintainer: surfaces risky API usage and correctness issues from changed lines plus nearby context.

Pause and recall

  • Why is code review primarily a delta-understanding problem?
  • In the pagination example, why was the new formula wrong for page 1 and page 3?
  • Why should AI review prioritize risk patterns over style comments?
  • What kind of bugs does AI review systematically miss?

Interview Q&A

Q: Why analyze diffs first instead of summarizing the full repository for code review? A: Because review value comes from understanding behavioral change, and the diff provides the highest-signal starting point before selective context expansion. Common wrong answer to avoid: "Because full-repo context is impossible for models."

Q: Why can AI review catch off-by-one or auth bugs surprisingly well? A: These bugs often manifest as local changed patterns with concrete invariants, which are exactly the kind of signals diff-focused models can reason about. Common wrong answer to avoid: "Because those bugs are trivial and never matter much."

Q: Why is precision more important than comment volume in review systems? A: Low-precision comments train engineers to ignore the tool, while a smaller set of concrete high-risk findings actually changes review outcomes. Common wrong answer to avoid: "More comments mean more coverage, so they are always better."

Q: Why should security and correctness review sometimes use different prompts or passes? A: They look for different invariants and evidence, and separating lenses can reduce vague comments and increase useful focus. Common wrong answer to avoid: "A single generic review prompt is always best because it sees everything at once."


Apply now (5 min)

Exercise. Take a tiny diff from your own code or invent one like the pagination change. Plug in page 1 and page 3 with limit 20, and write the exact review comment you would want to receive.

Sketch from memory. Draw old code and new code feeding into a diff box, then into a review comment box. Write one rule under it: changed behavior first, style later. See. That is review AI.


Bridge. Static review catches many local risks, but some failures appear only when the code actually runs. Next we use execution feedback as the strictest reviewer. → 10-execution-feedback-loops.md