Debug Looping Agent — Analysis¶
The bug in the original¶
def buggy_planner(state):
if state.last_tool_result.startswith("ERROR"):
return {"type": "tool", "tool": "search", "input": state.question}
return {"type": "tool", "tool": "search", "input": state.question}
Both branches return the same action. The if is dead code; whether the last call failed or not, the planner returns the same tool call. Combined with a flaky tool that always fails, this is an infinite loop bounded only by the hard_stop parameter.
The trace shows: same action, same error, every iteration. Hard-stop kicks in but only as a backstop; the agent never noticed it was stuck.
The fix — three structural defences¶
1. Track consecutive failures. A counter on the state, incremented on tool error, reset on success. After N consecutive failures, the planner returns a fail action.
2. Distinguish "in progress" from "done". The planner checks the last result: success → finish; error → retry (until consecutive-failure limit); first iteration → call tool.
3. Explicit failure state. The agent state has failed and failure_reason fields. On failure, these are populated; the caller can distinguish "didn't finish in time" from "tool kept failing".
These three defences map to common production patterns:
- Consecutive-failure counters are how production agents avoid retry storms on a broken downstream.
- Explicit termination conditions (finish, fail, max_iterations) keep the loop bounded.
- Structured state surfaces what happened, not just whether it happened.
Why the original would have failed in production¶
A real production agent loop without these defences:
- Hammers the downstream tool indefinitely (until the iteration cap, which in the original was 6 — short for development, far too long for production where 6 retries could mean 60 seconds of wasted work and downstream load).
- Returns the wrong final answer or no final answer; the caller doesn't know what happened.
- Logs are unreadable: every iteration looks identical.
- The agent appears "stuck" to anyone monitoring — no clear failure signal.
The fix turns the agent into a competent failure detector. The test test_terminates_in_bounded_time proves the loop doesn't hang; test_gives_up_after_consecutive_failures proves the failure pattern terminates the loop on the first failure pattern.
What this exercise teaches¶
- Agent loops need explicit termination conditions, not just iteration caps.
- A retry counter without a backoff or limit is a retry storm.
- Failed states should be queryable, not just "didn't finish".
- The most common agent bug is "the planner doesn't notice it's stuck."
The debugging method demonstrated¶
When debugging a stuck agent:
- Read the trace. Same action over and over → planner not responding to state.
- Check the planner's input. Does it see the failure? If
last_tool_result.startswith("ERROR")is the wrong check, the planner is blind. - Check the planner's output. Does it differ across iterations? If not, the planner has no state-dependent branch.
- Add a counter. Force the planner to react to a count of failures, not just the last one.
- Add explicit success/failure states. "Did this finish or fail" should be a state attribute, not inferred.
This is the diagnostic loop for any agent that doesn't terminate.
Interview probes¶
- "Walk me through debugging an agent that's stuck in a loop."
- "What is the difference between max_iterations and consecutive_failures as termination conditions?"
- "How would you structure the agent's state so failure is visible?"
- "What kinds of failures should an agent retry, and what kinds should it surface immediately?"
- "How would you add exponential backoff to this design?"