Testing Tool Execution: Verifying Agent Tool Calls and Side Effects
Learn how to test AI agent tool execution with tool mocking, call verification, parameter assertions, and side effect tracking using pytest in Python.
Why Tool Testing Deserves Its Own Strategy
AI agents that call tools interact with the real world — databases, APIs, file systems, payment processors. A bug in tool execution can send wrong emails, delete wrong records, or charge wrong amounts. Unlike text generation errors that are merely embarrassing, tool execution errors have real consequences.
Testing tool execution means verifying three things: the agent calls the right tool, passes the correct parameters, and your code handles the tool's response (or failure) correctly.
Building Testable Tool Interfaces
Design tools with a clean interface that separates the tool definition from its implementation.
flowchart TD
USER(["User message"])
LLM["LLM call<br/>with tools schema"]
DECIDE{"Model wants<br/>to call a tool?"}
EXEC["Execute tool<br/>sandboxed runtime"]
RESULT["Append tool_result<br/>to messages"]
GUARD{"Output passes<br/>guardrails?"}
DONE(["Final reply"])
BLOCK(["Refuse and log"])
USER --> LLM --> DECIDE
DECIDE -->|Yes| EXEC --> RESULT --> LLM
DECIDE -->|No| GUARD
GUARD -->|Yes| DONE
GUARD -->|No| BLOCK
style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
style EXEC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
style DONE fill:#059669,stroke:#047857,color:#fff
style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
from typing import Protocol, Any
from dataclasses import dataclass, field
class ToolExecutor(Protocol):
def execute(self, name: str, arguments: dict) -> Any: ...
@dataclass
class MockToolExecutor:
"""Records tool calls and returns predetermined responses."""
responses: dict[str, Any] = field(default_factory=dict)
call_log: list[dict] = field(default_factory=list)
def execute(self, name: str, arguments: dict) -> Any:
self.call_log.append({"name": name, "arguments": arguments})
if name in self.responses:
response = self.responses[name]
if callable(response):
return response(arguments)
return response
raise ValueError(f"No mock response configured for tool: {name}")
Injecting the executor through the constructor makes it trivial to swap the real implementation for the mock in tests.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Verifying Tool Selection
Test that the agent picks the correct tool for a given user request.
import pytest
from my_agent.core import Agent
from my_agent.tools import MockToolExecutor
@pytest.fixture
def mock_tools():
return MockToolExecutor(responses={
"search_orders": [{"id": 1, "status": "shipped"}],
"cancel_order": {"success": True},
"get_weather": {"temp": 72, "condition": "sunny"},
})
def test_order_query_uses_search_tool(mock_tools):
agent = Agent(tool_executor=mock_tools)
agent.run("Where is my order #12345?")
assert len(mock_tools.call_log) >= 1
tool_names = [c["name"] for c in mock_tools.call_log]
assert "search_orders" in tool_names
def test_weather_query_does_not_touch_orders(mock_tools):
agent = Agent(tool_executor=mock_tools)
agent.run("What is the weather in Chicago?")
tool_names = [c["name"] for c in mock_tools.call_log]
assert "search_orders" not in tool_names
assert "get_weather" in tool_names
Parameter Assertion Patterns
Verify that the agent extracts and passes correct parameters from the user's message.
def test_search_passes_correct_order_id(mock_tools):
agent = Agent(tool_executor=mock_tools)
agent.run("Check the status of order #98765")
search_calls = [c for c in mock_tools.call_log if c["name"] == "search_orders"]
assert len(search_calls) == 1
args = search_calls[0]["arguments"]
assert args["order_id"] == "98765" or args.get("query") == "98765"
def test_date_range_parsing(mock_tools):
agent = Agent(tool_executor=mock_tools)
agent.run("Show me all orders from last week")
search_calls = [c for c in mock_tools.call_log if c["name"] == "search_orders"]
args = search_calls[0]["arguments"]
assert "start_date" in args, "Agent should extract a date range"
assert "end_date" in args
Testing Side Effects Safely
For tools that modify state, use a spy pattern to verify the call would happen without actually executing it.
@dataclass
class SpyToolExecutor:
"""Like MockToolExecutor but also tracks which calls were 'destructive'."""
responses: dict[str, Any] = field(default_factory=dict)
call_log: list[dict] = field(default_factory=list)
destructive_tools: set = field(default_factory=lambda: {
"cancel_order", "delete_record", "send_email", "charge_payment"
})
def execute(self, name: str, arguments: dict) -> Any:
entry = {
"name": name,
"arguments": arguments,
"destructive": name in self.destructive_tools,
}
self.call_log.append(entry)
return self.responses.get(name, {"success": True})
@property
def destructive_calls(self) -> list[dict]:
return [c for c in self.call_log if c["destructive"]]
def test_cancellation_requires_confirmation(mock_tools):
"""Ensure destructive actions are not taken without confirmation."""
spy = SpyToolExecutor(responses={"cancel_order": {"success": True}})
agent = Agent(tool_executor=spy, require_confirmation=True)
result = agent.run("Cancel order #123")
# Agent should ask for confirmation, not immediately cancel
assert len(spy.destructive_calls) == 0
assert "confirm" in result.lower() or "sure" in result.lower()
Testing Tool Error Handling
Verify your agent handles tool failures gracefully.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
def test_agent_handles_tool_timeout(mock_tools):
mock_tools.responses["search_orders"] = TimeoutError("API timeout")
agent = Agent(tool_executor=mock_tools)
result = agent.run("Find my order #123")
assert "error" in result.lower() or "try again" in result.lower()
assert "traceback" not in result.lower() # No leaked internals
def test_agent_handles_tool_returning_empty(mock_tools):
mock_tools.responses["search_orders"] = []
agent = Agent(tool_executor=mock_tools)
result = agent.run("Find order #999999")
assert "not found" in result.lower() or "no results" in result.lower()
FAQ
How do I test tools that call external APIs?
Use the mock executor pattern shown above for unit tests. For integration tests, use a sandbox or staging environment of the external API. Many services (Stripe, Twilio) provide test modes specifically for this purpose.
Should I test tool execution order in multi-tool chains?
Yes, when order matters. For example, an agent should search before canceling. Assert on the order of entries in call_log. When order does not matter (parallel lookups), only verify that all expected tools were called.
How do I test tools that return large or complex payloads?
Create fixture files with realistic payloads and load them as mock responses. Test that your agent correctly extracts the relevant fields from complex nested structures rather than asserting on the entire payload.
#ToolExecution #AIAgents #Testing #Pytest #Mocking #Python #AgenticAI #LearnAI #AIEngineering
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.