04. Debate & Critique — disagreement as error detection¶

~10 min read. Two agents argue. A judge decides. The conflict is the feature.

Built on the ELI5 in 00-eli5.md. The org chart — who talks to whom — now allows disagreement by design. Not every department must agree. Sometimes the conflict reveals the truth.

1) The picture — two candidates, one judge¶

Look. Picture first. Then explanation. Two departments produce competing answers. Then the CEO picks the better one. Simple, no?

    ┌────────────┐     ┌────────────┐
    │ Candidate A │     │ Candidate B │
    │ (dept 1)    │     │ (dept 2)    │
    └──────┬─────┘     └──────┬─────┘
           │                  │
           └────────┬─────────┘
                    ▼
            ┌──────────────┐
            │    Judge     │
            │  (the CEO)   │
            └──────┬───────┘
                   ▼
             final answer

See. This is not random arguing. This is structured comparison under one org chart. Candidate A works alone. Candidate B works alone. The judge compares outputs after both finish. The final answer is not a vote. It is a reasoned choice. Independence matters. If both sides reuse the same search path, the benefit shrinks. If one side paraphrases the other, debate is fake. Good prompts create separation on purpose. One side may search broader. The other may verify stricter. One may argue yes. The other may try to break that claim. Now what is the problem? A single agent can sound smooth while still being wrong. You hear one polished answer. You do not see the missed alternative. Debate makes alternatives visible. That visibility is the feature.

2) Why disagreement helps¶

See. Agreement and disagreement both carry signal. If two independent agents land on the same answer, confidence goes up. Not certainty. Confidence. That is convergent evidence. If they disagree, that is useful too. The conflict says the task is harder than it looked. Maybe the sources differ. Maybe the wording is ambiguous. Maybe one reasoning step is weak. Without disagreement, false confidence hides nicely. With disagreement, hidden cracks appear. Think of peer review in research. One reviewer may miss a gap. Two reviewers often expose more. Not because either is perfect. One asks for missing evidence. Another asks if it is current. Another asks whether the comparison was fair. The gap becomes visible. Simple, no? Now what is the problem? Many systems treat disagreement as a nuisance. Debate systems treat it as routing information. It tells the CEO where extra scrutiny belongs. Agreement means convergent evidence. Disagreement means targeted review. Both outcomes help. A good rubric makes this stronger. Ask which answer used better evidence. Ask which answer handled counterarguments. Ask which answer is more current. Ask which answer admitted uncertainty honestly. Then the department with stronger reasoning wins.

3) The critique variant¶

Sometimes you do not want two full answers. You want one draft and one serious checker. That is the critique pattern. It is cheaper than full debate. It still catches many errors.

┌──────────┐     ┌──────────┐
│  Drafter  │──→ │  Critic  │
└──────────┘     └─────┬────┘
      ▲                │
      └────────────────┘
         revise loop

See the flow. The drafter makes version one. The critic points out flaws, weak evidence, bad logic, or policy misses. Then the drafter revises. This is still disagreement by design. Only the topology is lighter. Instead of two candidates and one judge, you get one maker and one checker. Why use this? Many production tasks need flaw detection more than alternative generation. A contract draft needs this. A security review needs this. A code patch needs this. A support reply with compliance rules needs this. The first agent proposes. The second agent inspects. That inspection catches cheap mistakes early. But bound it. Otherwise the loop grows noisy. Set a hard ceiling of two or three revision rounds. After that, stop. Ship, escalate, or ask a human. Simple, no? Also keep the critic narrow. Check factual support. Check edge cases. Check contradictions. Check policy violations. Do not ask the critic to rewrite the whole piece. Then the role becomes fuzzy and expensive. Keep the org chart legible.

4) Worked example — fact-checking with debate¶

Task: "Is India's UPI processing more than 10 billion transactions per month?" See the setup. Agent A searches recent data and finds 12.02 billion transactions in December 2023. It cites official NPCI monthly data. Its answer is yes. Agent B searches too. It finds 8.6 billion transactions from older reporting. It does not fully reject the newer claim. Instead, it flags a discrepancy across time. Its answer becomes: maybe yes now, but older monthly data was lower. Now the judge steps in. Not with vibes. With a checklist. Step 1: compare recency. Agent A uses December 2023 data. Agent B uses older monthly data. The more recent number gets more weight. Step 2: compare source quality. Agent A uses official NPCI data. Agent B relies on older summaries or secondary reporting. Primary data is stronger. Step 3: compare claim scope. The question asks whether UPI is processing more than 10 billion per month. A current monthly number answers that directly. An older monthly number gives history, not the best current answer. Step 4: compare honesty. Agent A should state the month clearly. Agent B correctly notices the temporal mismatch. That criticism is useful. It improves the final explanation. Final judgment: Yes. India's UPI was processing more than 10 billion transactions per month. The strongest support is the 12.02 billion figure from December 2023, backed by NPCI. The disagreement happened because the two agents used different time windows. Simple, no? So what is the lesson? Debate did not only select a winner. It exposed the temporal gap. That makes the answer more trustworthy. The CEO should not merely pick. The CEO should explain why one answer beats the other.

5) When debate fails¶

Now be careful. Debate is not magic. Sometimes it adds cost without adding truth. Weak judges create expensive noise. If the judge cannot distinguish evidence quality, debate is wasted tokens. You buy more text, not better decisions. A weak the CEO wastes both departments. Agents can also converge on the same wrong answer. Shared training data creates shared blindspots. If both sides lean on the same stale source, agreement is misleading. So independence must be real, not decorative. Cost rises quickly. Two candidate generations plus one judge often means roughly 3× the tokens. Latency rises too. For low-stakes tasks, that trade-off is poor. Vague rubrics break the whole setup. If nobody defines what "better" means, the judge drifts. One run rewards style. Another rewards confidence. Another rewards length. Reliability collapses. So when does debate work best? When the stakes justify extra cost. When the judging rubric is clear. When disagreement is genuinely informative. When evidence quality can be compared. When humans can audit the reasoning if needed. Not every question needs a courtroom.

Where this lives in the wild¶

Anthropic Constitutional AI — safety engineer: one model generates, another critiques against principles, then the draft gets revised.
Google Search quality systems — ranking engineer: multiple ranking models disagree, and unusual disagreement can trigger deeper review or human checks.
Legal document review platforms — attorney: two AI reviewers independently flag risks, and disagreements get escalated to a human lawyer.
Code review systems — software engineer: one agent writes code, another reviews for bugs and security, then both iterate before merge.
Financial analysis copilots — portfolio analyst: bull-case and bear-case agents argue before a recommendation reaches the decision layer.

Pause and recall¶

Why is disagreement useful even when it increases latency?
What does agreement mean in a well-designed debate topology?
Why is a drafter-critic loop cheaper than full debate?
What made Agent A stronger than Agent B in the UPI example?

Interview Q&A¶

Q: Why use debate instead of one stronger model with a longer prompt? A: One long prompt still gives one visible chain of reasoning. Debate creates explicit alternatives and makes disagreement inspectable. Common wrong answer to avoid: "Because two models are always smarter than one" — value comes from structured independence and good judging.

Q: Why use a judge after two candidates, not simple majority voting? A: Quality is not the same as vote count. A judge can weigh recency, source quality, counterarguments, and uncertainty handling. Common wrong answer to avoid: "Because majority voting needs three agents" — the real issue is evaluation quality, not arithmetic.

Q: Why pick critique over full debate for many production workflows? A: Critique is cheaper and often enough when the main need is flaw detection, not broad alternative generation. Common wrong answer to avoid: "Because critique guarantees correctness" — it only raises the chance of catching mistakes.

Q: Why does debate fail when the rubric is vague? A: The judge has no stable basis for choosing. Then outputs drift by style, confidence, or noise instead of evidence. Common wrong answer to avoid: "Because agents dislike ambiguity" — the deeper issue is inconsistent evaluation criteria.

Apply now (5 min)¶

Exercise: Take one factual question from your domain. Ask Agent A to argue yes. Ask Agent B to argue no. Then write a three-line judging rubric using evidence quality, recency, and uncertainty. Sketch from memory: Draw two candidate boxes and one judge box. Then redraw the cheaper drafter-critic loop with a maximum of three revisions. Finally, say where the org chart changes and where it stays the same.

Bridge. Three topologies so far: orchestrator-worker, pipeline, debate. But what happens when the company grows? One CEO cannot manage 20 departments directly. And sometimes departments need to talk without going through the CEO. Next: hierarchical and peer-to-peer patterns. → 05-hierarchical-peer.md