Skip to content

Tool-Calling Agent — Analysis

The structure

question
   ↓ plan_fn(question, scratchpad) → AgentStep
   ↓ if step.kind == "finish": return answer
   ↓ if step.kind == "tool": execute tool, append result to scratchpad
   ↓ loop until finish or max_iterations

The agent is a loop. Each iteration: plan, act, observe, repeat. The plan function decides which tool to call (or to finish). The scratchpad accumulates observations across iterations.

Key design points

Max iterations cap. Without it, a broken plan function loops forever. Hard cap is structural defence; the test test_max_iterations_cap proves the loop terminates.

Pluggable plan function. Tests pass deliberately-broken plan functions (always loops, returns unknown tools) to exercise the agent's resilience. In production, plan_fn is replaced by an LLM call: format the question + scratchpad as a prompt; parse the model's "next action" output.

Tool errors are captured, not raised. A tool that throws an exception returns "ERROR: ..." in the scratchpad; the agent continues and the next plan can route around the failure.

Unknown tool calls don't crash. The plan function might hallucinate a tool name; the agent logs the failure and continues. This is essential for LLM-driven agents which sometimes invent tool names.

Audit trail (AgentTrace). Every step is logged: which tool, what input, what result. This is the structural defence for debugging — when the agent's final answer is wrong, the trace shows where reasoning went off.

Why safe_eval not exec/eval

The starter uses safe_eval — an AST walker that allows only arithmetic ops. A naive eval(expression) would let the agent run arbitrary Python (__import__('os').system('rm -rf /')). The AST whitelist is the structural defence; the test test_unsupported_expression_raises proves dangerous expressions are rejected.

This is the simplest example of the tool-execution-sandbox principle: even a tiny calculator tool needs a sandbox if the input comes from an LLM.

The mock plan function

For testing, the plan is deterministic — pattern-match the question, decide which tool. In production, the plan function calls an LLM with:

  • The user question.
  • The scratchpad (previous tool calls and results).
  • A description of available tools (in the system prompt or as a tool-use schema).

Modern LLM APIs (OpenAI function calling, Anthropic tool use) format this as a structured request; the model returns either a tool call or a final answer. The agent's job is to interpret that response and loop.

What this implementation deliberately omits

  • Real LLM call. Plug in via plan_fn; modern providers offer tool-use modes that return structured "call this tool with these args" outputs.
  • Tool schemas. A real LLM-driven agent passes tool schemas (JSON Schema for arguments) so the model produces valid inputs.
  • Parallel tool calls. Some agents (Claude 3.5+, GPT-4) can request multiple tool calls in one step; the agent executes them in parallel.
  • Tool result truncation. Large tool outputs need to be summarised or truncated before being added to the scratchpad.
  • Tool selection bias mitigation. When many tools exist, agents pick the wrong one; tools should have unambiguous names and descriptions.

Interview probes

  • "How do you prevent infinite loops in tool-calling agents?"
  • "Why is safe_eval necessary even for a simple calculator?"
  • "What goes in the agent's scratchpad, and how does it influence the next plan?"
  • "What happens when the LLM hallucinates a tool name?"
  • "How do you debug an agent that gives the wrong final answer?"
  • "How would you migrate this to a real LLM with function calling?"