Skip to content

09. Loop-layer bugs — when the control flow itself is broken

~14 min read. The third suspect. Not the prompt. Not the tool. The way the agent decides "one more turn?" or "stop now."

Built on the ELI5 in 00-eli5.md. The suspects — prompt, tool, loop, memory, model — reach the third name. The lineup cleared prompt and tool. Now we test the loop. If the agent never stops, stops too early, or spins between two tools, the case file has a shape prose cannot describe. Look at the loop trace.


The healthy loop, before any pathology

   step 1        step 2        step 3        step 4
┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐
│ think   │→ │ act     │→ │ observe │→ │ final   │
│ plan    │  │ tool A  │  │ result  │  │ answer  │
└────┬────┘  └─────────┘  └────┬────┘  └─────────┘
     │       stopping rule     │
     │    "have 3 sources?"    │
     └─────────────────────────┘

Three properties. Progress per step. A clear give-up rule (from module 16). A hard step cap. Plot steps vs cumulative useful work — up and right, flatlining at done. Now break it seven ways.


The seven pathological patterns

1. Runaway loop

Stopping condition never fires. Agent thinks it needs more info, forever.

step 1 → ... → step 99 → step 100 → BUDGET EXCEEDED
   ↑                                    │
   └── "I should gather more context" ──┘

Signature — step counter climbs to budget cap. No final answer span.

Elimination test (per lineup in chapter 06) — strip the loop. Run the agent's reasoning as a single call with all collected context. Confident answer? Loop's stop rule is broken, not the model.

Fix — explicit stop condition. Say "stop when 3 verified sources" or "stop when confidence > 0.8."

2. Premature stop

Agent halts after one or two turns. Answer half-baked.

Signature — total steps far below typical. Final answer hedges. Complaint slip says "incomplete."

Elimination test — re-run with "do not stop until at least 3 tool calls." Answer improves? Stop rule too aggressive.

Fix — raise confidence threshold. Require minimum tool calls.

3. Infinite retry

Tool errors. Agent retries. Errors again. Forever.

step 4: call tool X → error "rate limited"
step 5: call tool X → error "rate limited"
...     (300 identical retries)

Signature — same span name, same error, climbing count. No backoff.

Elimination test — single isolated tool call with same args. Errors? Tool is failing. Bug is the retry has no cap.

Fix — exponential backoff + retry cap (3 attempts). After cap, escalate.

4. Oscillation between two tools

Agent flips A, B, A, B. No progress.

step 5: search_papers("transformer")
step 6: observe — 50 results
step 7: summarize_paper(paper_1)
step 8: observe — "need more context"
step 9: search_papers("transformer")  ← back to search
step 10: summarize_paper(paper_1)     ← same paper

Signature — two span names alternating, same args repeating.

Root cause — conflicting tool descriptions. Or unclear shared state — agent does not see last turn's observation already answered the question.

Elimination test — read both descriptions side by side. Overlap? Run with only tool A. Converge? Then only B. Both work alone — the conflict is the bug.

Fix — rewrite descriptions to be mutually exclusive. Add a state tracker.

5. Stuck planning

Agent re-plans every turn. Never executes.

step 1: "Let me plan. X, Y, Z."
step 2: "Actually, re-plan. A, B."
step 3: "Wait, re-plan. X, C."

Signature — every span is reasoning or plan. No tool_call spans.

Elimination test — strip the planning instruction. Tools called now? Planning was over-triggered.

Fix — limit planning to step 1. Then "execute next step from plan, do not re-plan."

6. Hidden recursion

Agent A spawns sub-agent B with the same toolset. B spawns C. Tokens explode.

agent A (depth 0)
  └── agent B (depth 1)
        └── agent C (depth 2) → ... (no cap)

Signature — deeply nested trace tree. Tokens grow exponentially.

Elimination test — disable sub-agent calls. Solves task flat? Recursion is unnecessary.

Fix — depth cap. Three levels max. Each sub-agent smaller budget than parent.

7. Step-budget burn without progress

All 50 steps used. No errors. No oscillation. Yet no closer to goal.

Signature — diverse tool calls, no repetition, no errors. Work plot drifts sideways.

Hardest loop bug. Task is under-specified. Or the give-up rule from module 16 is absent — no way to say "I cannot, escalate."

Elimination test — examine final state. Close to a solvable path but tangenting? Or wrong track from step 1? First means stopping logic; second means task framing.

Fix — mid-loop checkpoint. Every 10 steps, ask "closer to goal? Yes/no/escalate."


Worked example — research-agent oscillation

Real case. Research assistant. Step cap 100. Hit step 50 without convergence. The case file showed:

step 14: search_papers("attention mechanism") → 47 results
step 15: summarize_paper(id=1234)             → summary
step 16: search_papers("attention mechanism") → 47 results (same)
step 17: summarize_paper(id=1234)             → same summary
... (steps 18-50 alternating)

Pattern 4 — oscillation. Same two tools, same args.

Elimination test from chapter 06 lineup. Strip the loop. Give the model all 47 results + summary in one prompt. "Enough to answer?" Model says yes. Loop is the suspect, confirmed.

Read tool descriptions. search_papers — "Use this to find relevant papers." summarize_paper — "Use this to read a paper in detail." Both say "use this." Neither says when to stop. System prompt says "be thorough." No stopping rule anywhere.

Confession. The give-up rule from module 16 was never written. No number — no "3 sources" or "80% confidence."

Fix. Add to system prompt: "Stop calling search_papers once you have 3 relevant papers. Stop calling summarize_paper once those 3 are summarized. Then write the final answer." Re-run. Converged at step 8.

Lock. Regression eval — exact failing query, assert total_steps < 15. Future versions must pass. The lineup cleared each suspect in order. The confession was loop, not prompt, not tool.


The loop trace shape diagnostic

Steps on x-axis, useful work on y-axis. Each bug has a shape.

RUNAWAY              PREMATURE STOP        INFINITE RETRY
work│                work│                 work│
    │ ───── flat        │_/── early stop      │_____ flat at 0
    └→ cap              └→ stops at 2         └→ cap hit, ↑↑↑ errors

OSCILLATION          STUCK PLANNING        HIDDEN RECURSION
work│                work│                 tokens│
    │_____ flat         │___ flat               │ ╱ exponential
    │ABABAB             │planplanplan           │╱
    └→                  └→ no tool spans        └→ depth grows

STEP-BUDGET BURN
work│
    │ ╱╲╱╲╱╲ wanders, no convergence
    └→

Glance, name the bug. Aggregate the shape across many traces in the crime statistics view — find the bug class hitting many users.


Loop-pathology patterns in production agents

  • AutoGPT — infamous for runaway loops at viral release. Users reported $100+ overnight bills, agent endlessly "re-evaluating its plan." Community fix — hard step cap + budget cap. The role this teaches: an explicit give-up rule is the difference between an autonomous agent and a runaway process.
  • BabyAGI — the original planning loop. Demos generated thousands of sub-tasks for one goal — classic stuck-planning. Later versions added max-task-list cap. The pathology was treating plan expansion as free; the lock was a max-tasks budget.
  • LangGraph cycle detection — the graph compiler refuses cycles unless the developer explicitly marks them with a recursion_limit. The role is to make pattern 1 (runaway) a build-time decision rather than a 3am incident.
  • OpenAI Agents SDK loop budgetmax_turns defaults to a small number (typically ~10) and raises when exceeded; the role is forcing the developer to choose a budget rather than inherit infinity.
  • Anthropic Claude max_tokens guardrail — the response budget is enforced at decode time; a runaway agent burns request count, not unbounded tokens per call. Cap composition: per-turn token cap times per-task turn cap equals worst-case spend.
  • ReAct framework loop bounds — the original ReAct paper's reference loop has an explicit max_steps; the role is preventing pattern 7 (step-budget burn) by giving the loop a clear deadline.
  • Claude Code — Anthropic's coding agent ships with explicit step caps and max-tokens-per-task. The design choice is anti-AutoGPT — fail loudly at a cap.
  • Cursor's stop-iteration heuristics — the agent watches for repeated tool calls with identical args and triggers an early stop, surfacing pattern 4 (oscillation) before the budget is exhausted.
  • Devin (Cognition, early demos) — public traces showed the agent stuck in re-planning, 30+ minutes "thinking" before action. Fix — tighter plan-then-execute separation.
  • GitHub Copilot Workspace — iteration limits per phase (spec, plan, implement). Hit the cap, escalate to user. The role is decomposing one budget into phase budgets so a stuck phase cannot burn the whole task.
  • MetaGPT escape-hatch design — each role has an explicit "I cannot proceed" output; the role is converting pattern 7 (silent budget burn) into a loud escalation.
  • CrewAI process typessequential vs hierarchical vs consensual each carry their own loop semantics; mismatching the process to the task is a documented oscillation cause when sub-crews re-enter the same step.
  • AWS Bedrock Agents step limitsmaxIterations caps the orchestrator's action loop; the role is preventing unbounded tool invocation when a planner cannot converge.
  • AgentOps runaway-detection — the observability product specifically alerts on rising step count without rising distinct work; the role is detecting pattern 7 from outside the loop because the loop itself cannot see its own drift.
  • AutoGen group chat — multi-agent loops require an explicit max_round; without it, agents politely echo each other. The role is teaching that loop safety scales with the number of participants.

Recall — name the loop pathology from its shape

  1. Time-vs-step plot of runaway vs oscillation — how do they differ?
  2. In the lineup elimination test, how do you isolate the loop from prompt and tool?
  3. Why is stuck planning different from runaway? What span types differ?
  4. What is the depth-cap principle for hidden recursion?

Interview Q&A

Q: Your agent hits step 50 of 50 every run. How do you debug? A: Open the case file. Plot steps vs distinct tool calls. Two tools alternating — oscillation, fix descriptions. Diverse tools but no progress — stop rule missing, add "stop when X." Retry errors dominate — add backoff. Run the elimination test from the lineup — strip the loop, run as one call. Converges? The loop is the confession.

Common wrong answer to avoid: "Increase the step cap to 100" — this hides the bug. The agent will burn 100 steps next time. The cap is a safety net, not a fix.

Q: Half-baked answer with hedging after 2 turns. Loop or prompt bug? A: Likely loop — premature stop. Confidence threshold too aggressive. Test by re-running with "do not stop until at least 3 tool calls." Answer improves? Stop logic is broken. The lineup told you which to test first.

Common wrong answer to avoid: "The model is too weak, upgrade" — a larger model with the same broken stop rule will also stop early. Test the loop layer first.

Q: Difference between runaway and step-budget burn without progress? A: Runaway has a missing stop rule — agent thinks it must keep going. Step-budget burn has a present stop rule but the agent never satisfies it because the task is under-specified. Runaway shows repetitive thinking; budget-burn shows diverse but unproductive actions. Different fixes — stop condition vs mid-loop checkpoint.

Common wrong answer to avoid: "Both are the same — just hit the cap" — the cap is the symptom. The confession differs.

Q: Sub-agent spawns sub-sub-agents and tokens explode. Prevention? A: Hidden recursion. Depth cap — 3 levels max. Each child gets a smaller budget than parent. Track depth as an evidence tag. Alert when depth > 2 in production. Elimination test — disable sub-agent calls. Flat agent solves task? Recursion is unnecessary.

Common wrong answer to avoid: "Just count tokens globally and cap" — a global cap stops runaway but does not tell you which level is to blame. Per-level depth cap gives the root cause.


Apply now (10 min)

Step 1 — model the exercise. Take the research-agent oscillation trace from earlier. Sketch its time-vs-step plot: a flat work line through step 50 with search_papers and summarize_paper alternating at every step. That shape is pattern 4 (oscillation). Now sketch the fix: the new stop rule says "stop calling search once you have 3 papers, stop calling summarize once those 3 are summarized," and the trace converges at step 8 with the work line stepping upward and flattening at done. Two plots, same agent, the lock drawn between them.

Step 2 — your turn. For an agent you have built or maintain, list every tool. For each pair, ask "can the agent confuse these two?" Any pair where the descriptions overlap is an oscillation risk worth rewriting. Then write a stop condition for the agent in one concrete sentence — name a number ("3 sources") and a confidence threshold ("or confidence > 0.8"). If you cannot write that sentence, the give-up rule does not exist yet, and pattern 1 or pattern 7 is one query away.

Step 3 — reproduce from memory. Without looking, draw the seven loop-bug shapes on the time-vs-step axis: runaway, premature stop, infinite retry, oscillation, stuck planning, hidden recursion, step-budget burn. Label each shape with the trace signature that names it and the one-line fix. Then write a sentence connecting this chapter to chapter 06's lineup: the loop is the third suspect because its bugs only become visible once the prompt and tool have alibis, and its shape can be diagnosed from the trace plot before any prose is read.


What you should remember

This chapter walked the third suspect in the debugging lineup — the control flow itself. A healthy loop has three properties: progress per step, a clear give-up rule, and a hard step cap. Strip any one of them and the seven pathologies surface in turn: runaway when the stop condition never fires, premature stop when it fires too soon, infinite retry when errors are immortal, oscillation when two tools have overlapping descriptions, stuck planning when re-planning is free, hidden recursion when sub-agents spawn sub-agents, and step-budget burn when the task itself is under-specified.

The signature of each pathology is a different shape on the steps-vs-useful-work plot. That diagnostic is the loop layer's equivalent of the spec-vs-prompt diff from the tool chapter — a single glance at the case file names the bug before you read a span. The fixes share a pattern: write the give-up rule as a number and a threshold, cap retries, separate planning from execution, and depth-cap recursion. None of these are clever; they are explicit budgets that turn an autonomous agent into a debuggable one.

Carry this diagnostic forward. When a complaint slip says "the agent ran for 30 minutes and gave up," do not raise the cap. Open the case file, plot steps versus distinct work, and let the shape pick the confession for you. Raising the cap is the most expensive non-fix in agent debugging because next week's bill will be larger and the bug will still be there.

Remember:

  • A healthy loop is progress per step plus a numeric give-up rule plus a hard cap; remove any one and a pathology appears.
  • Seven bug shapes correspond to seven plot signatures: runaway flat-to-cap, premature early stop, infinite retry at zero work, oscillation flat with alternating tools, stuck planning with no tool spans, hidden recursion with exponential tokens, step-budget burn with diverse but unproductive actions.
  • The elimination test for any loop bug is to strip the loop: run the model as a single call with all collected context. If it answers, the confession lives in the control flow.
  • Raising the step cap is a non-fix; it hides the suspect behind a bigger bill. The cap is a safety net, not a stop rule.
  • Mid-loop checkpoints ("closer to goal? yes / no / escalate") are how pattern 7 (silent budget burn) is converted into a loud escalation a human can answer.

Bridge. The lineup cleared the loop. The prompt was fine. The tool was fine. The control flow now stops on time and does not spin. But the agent might still make bad decisions — because of what it remembers. Stale state. Leaked context from another user. A retrieval that returned last week's data. The fourth suspect is memory. → 10-memory-bugs.md