Debug Looping Agent — Analysis¶

The bug in the original¶

def buggy_planner(state):
    if state.last_tool_result.startswith("ERROR"):
        return {"type": "tool", "tool": "search", "input": state.question}
    return {"type": "tool", "tool": "search", "input": state.question}

Both branches return the same action. The if is dead code; whether the last call failed or not, the planner returns the same tool call. Combined with a flaky tool that always fails, this is an infinite loop bounded only by the hard_stop parameter.

The trace shows: same action, same error, every iteration. Hard-stop kicks in but only as a backstop; the agent never noticed it was stuck.

The fix — three structural defences¶

1. Track consecutive failures. A counter on the state, incremented on tool error, reset on success. After N consecutive failures, the planner returns a fail action.

2. Distinguish "in progress" from "done". The planner checks the last result: success → finish; error → retry (until consecutive-failure limit); first iteration → call tool.

3. Explicit failure state. The agent state has failed and failure_reason fields. On failure, these are populated; the caller can distinguish "didn't finish in time" from "tool kept failing".

These three defences map to common production patterns:

Consecutive-failure counters are how production agents avoid retry storms on a broken downstream.
Explicit termination conditions (finish, fail, max_iterations) keep the loop bounded.
Structured state surfaces what happened, not just whether it happened.

Why the original would have failed in production¶

A real production agent loop without these defences:

Hammers the downstream tool indefinitely (until the iteration cap, which in the original was 6 — short for development, far too long for production where 6 retries could mean 60 seconds of wasted work and downstream load).
Returns the wrong final answer or no final answer; the caller doesn't know what happened.
Logs are unreadable: every iteration looks identical.
The agent appears "stuck" to anyone monitoring — no clear failure signal.

The fix turns the agent into a competent failure detector. The test test_terminates_in_bounded_time proves the loop doesn't hang; test_gives_up_after_consecutive_failures proves the failure pattern terminates the loop on the first failure pattern.

What this exercise teaches¶

Agent loops need explicit termination conditions, not just iteration caps.
A retry counter without a backoff or limit is a retry storm.
Failed states should be queryable, not just "didn't finish".
The most common agent bug is "the planner doesn't notice it's stuck."

The debugging method demonstrated¶

When debugging a stuck agent:

Read the trace. Same action over and over → planner not responding to state.
Check the planner's input. Does it see the failure? If last_tool_result.startswith("ERROR") is the wrong check, the planner is blind.
Check the planner's output. Does it differ across iterations? If not, the planner has no state-dependent branch.
Add a counter. Force the planner to react to a count of failures, not just the last one.
Add explicit success/failure states. "Did this finish or fail" should be a state attribute, not inferred.

This is the diagnostic loop for any agent that doesn't terminate.

Interview probes¶

"Walk me through debugging an agent that's stuck in a loop."
"What is the difference between max_iterations and consecutive_failures as termination conditions?"
"How would you structure the agent's state so failure is visible?"
"What kinds of failures should an agent retry, and what kinds should it surface immediately?"
"How would you add exponential backoff to this design?"