Skip to content

Debug Looping Agent — Analysis

The bug in the original

def buggy_planner(state):
    if state.last_tool_result.startswith("ERROR"):
        return {"type": "tool", "tool": "search", "input": state.question}
    return {"type": "tool", "tool": "search", "input": state.question}

Both branches return the same action. The if is dead code; whether the last call failed or not, the planner returns the same tool call. Combined with a flaky tool that always fails, this is an infinite loop bounded only by the hard_stop parameter.

The trace shows: same action, same error, every iteration. Hard-stop kicks in but only as a backstop; the agent never noticed it was stuck.

The fix — three structural defences

1. Track consecutive failures. A counter on the state, incremented on tool error, reset on success. After N consecutive failures, the planner returns a fail action.

2. Distinguish "in progress" from "done". The planner checks the last result: success → finish; error → retry (until consecutive-failure limit); first iteration → call tool.

3. Explicit failure state. The agent state has failed and failure_reason fields. On failure, these are populated; the caller can distinguish "didn't finish in time" from "tool kept failing".

These three defences map to common production patterns:

  • Consecutive-failure counters are how production agents avoid retry storms on a broken downstream.
  • Explicit termination conditions (finish, fail, max_iterations) keep the loop bounded.
  • Structured state surfaces what happened, not just whether it happened.

Why the original would have failed in production

A real production agent loop without these defences:

  • Hammers the downstream tool indefinitely (until the iteration cap, which in the original was 6 — short for development, far too long for production where 6 retries could mean 60 seconds of wasted work and downstream load).
  • Returns the wrong final answer or no final answer; the caller doesn't know what happened.
  • Logs are unreadable: every iteration looks identical.
  • The agent appears "stuck" to anyone monitoring — no clear failure signal.

The fix turns the agent into a competent failure detector. The test test_terminates_in_bounded_time proves the loop doesn't hang; test_gives_up_after_consecutive_failures proves the failure pattern terminates the loop on the first failure pattern.

What this exercise teaches

  • Agent loops need explicit termination conditions, not just iteration caps.
  • A retry counter without a backoff or limit is a retry storm.
  • Failed states should be queryable, not just "didn't finish".
  • The most common agent bug is "the planner doesn't notice it's stuck."

The debugging method demonstrated

When debugging a stuck agent:

  1. Read the trace. Same action over and over → planner not responding to state.
  2. Check the planner's input. Does it see the failure? If last_tool_result.startswith("ERROR") is the wrong check, the planner is blind.
  3. Check the planner's output. Does it differ across iterations? If not, the planner has no state-dependent branch.
  4. Add a counter. Force the planner to react to a count of failures, not just the last one.
  5. Add explicit success/failure states. "Did this finish or fail" should be a state attribute, not inferred.

This is the diagnostic loop for any agent that doesn't terminate.

Interview probes

  • "Walk me through debugging an agent that's stuck in a loop."
  • "What is the difference between max_iterations and consecutive_failures as termination conditions?"
  • "How would you structure the agent's state so failure is visible?"
  • "What kinds of failures should an agent retry, and what kinds should it surface immediately?"
  • "How would you add exponential backoff to this design?"