Tool-Calling Agent — Analysis¶

The structure¶

question
   ↓ plan_fn(question, scratchpad) → AgentStep
   ↓ if step.kind == "finish": return answer
   ↓ if step.kind == "tool": execute tool, append result to scratchpad
   ↓ loop until finish or max_iterations

The agent is a loop. Each iteration: plan, act, observe, repeat. The plan function decides which tool to call (or to finish). The scratchpad accumulates observations across iterations.

Key design points¶

Max iterations cap. Without it, a broken plan function loops forever. Hard cap is structural defence; the test test_max_iterations_cap proves the loop terminates.

Pluggable plan function. Tests pass deliberately-broken plan functions (always loops, returns unknown tools) to exercise the agent's resilience. In production, plan_fn is replaced by an LLM call: format the question + scratchpad as a prompt; parse the model's "next action" output.

Tool errors are captured, not raised. A tool that throws an exception returns "ERROR: ..." in the scratchpad; the agent continues and the next plan can route around the failure.

Unknown tool calls don't crash. The plan function might hallucinate a tool name; the agent logs the failure and continues. This is essential for LLM-driven agents which sometimes invent tool names.

Audit trail (AgentTrace). Every step is logged: which tool, what input, what result. This is the structural defence for debugging — when the agent's final answer is wrong, the trace shows where reasoning went off.

Why safe_eval not exec/eval¶

The starter uses safe_eval — an AST walker that allows only arithmetic ops. A naive eval(expression) would let the agent run arbitrary Python (__import__('os').system('rm -rf /')). The AST whitelist is the structural defence; the test test_unsupported_expression_raises proves dangerous expressions are rejected.

This is the simplest example of the tool-execution-sandbox principle: even a tiny calculator tool needs a sandbox if the input comes from an LLM.

The mock plan function¶

For testing, the plan is deterministic — pattern-match the question, decide which tool. In production, the plan function calls an LLM with:

The user question.
The scratchpad (previous tool calls and results).
A description of available tools (in the system prompt or as a tool-use schema).

Modern LLM APIs (OpenAI function calling, Anthropic tool use) format this as a structured request; the model returns either a tool call or a final answer. The agent's job is to interpret that response and loop.

What this implementation deliberately omits¶

Real LLM call. Plug in via plan_fn; modern providers offer tool-use modes that return structured "call this tool with these args" outputs.
Tool schemas. A real LLM-driven agent passes tool schemas (JSON Schema for arguments) so the model produces valid inputs.
Parallel tool calls. Some agents (Claude 3.5+, GPT-4) can request multiple tool calls in one step; the agent executes them in parallel.
Tool result truncation. Large tool outputs need to be summarised or truncated before being added to the scratchpad.
Tool selection bias mitigation. When many tools exist, agents pick the wrong one; tools should have unambiguous names and descriptions.

Interview probes¶

"How do you prevent infinite loops in tool-calling agents?"
"Why is safe_eval necessary even for a simple calculator?"
"What goes in the agent's scratchpad, and how does it influence the next plan?"
"What happens when the LLM hallucinates a tool name?"
"How do you debug an agent that gives the wrong final answer?"
"How would you migrate this to a real LLM with function calling?"