AI Agent Testing Strategies: Unit, Integration, and End-to-End Approaches
A practical framework for testing AI agent systems including deterministic unit tests, integration tests with mock LLMs, and end-to-end evaluation with LLM-as-judge patterns.
The Testing Problem Is Different for Agents
Traditional software testing relies on deterministic behavior: given input X, expect output Y. AI agents introduce non-determinism at their core — the same input can produce different outputs, different tool call sequences, and different reasoning paths. This does not mean agents are untestable. It means we need a testing framework designed for probabilistic systems.
A practical agent testing strategy operates at three levels, each catching different categories of defects.
Level 1: Unit Tests (Deterministic)
Unit tests validate the deterministic components of your agent system — everything except the LLM calls themselves.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
flowchart LR
PR(["PR opened"])
UNIT["Unit tests"]
EVAL["Eval harness<br/>PromptFoo or Braintrust"]
GOLD[("Golden set<br/>200 tagged cases")]
JUDGE["LLM as judge<br/>plus regex graders"]
SCORE["Aggregate score<br/>and per slice"]
GATE{"Score regress<br/>more than 2 percent?"}
BLOCK(["Block merge"])
MERGE(["Merge to main"])
PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
GATE -->|Yes| BLOCK
GATE -->|No| MERGE
style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
style MERGE fill:#059669,stroke:#047857,color:#fff
What to Unit Test
- Tool functions: Each tool the agent can call should have standard unit tests with known inputs and expected outputs
- State management: State transitions, reducers, and serialization logic
- Input validation: Prompt template rendering, parameter parsing, and guardrail logic
- Output parsing: Extracting structured data from LLM responses
# Test a tool function deterministically
def test_calculate_shipping_cost():
result = calculate_shipping(weight_kg=2.5, destination="US", method="express")
assert result["cost"] == 24.99
assert result["estimated_days"] == 3
# Test output parsing
def test_parse_agent_action():
raw_response = "I'll look up the order. ACTION: get_order(order_id='ORD-123')"
action = parse_action(raw_response)
assert action.tool == "get_order"
assert action.params == {"order_id": "ORD-123"}
Mock LLM Responses
For unit testing agent control flow, replace the LLM with deterministic mock responses:
class MockLLM:
def __init__(self, responses: list[str]):
self.responses = iter(responses)
async def generate(self, prompt: str) -> str:
return next(self.responses)
# Test the agent's decision logic with predictable LLM outputs
async def test_agent_routes_to_billing():
mock = MockLLM(["The customer is asking about billing."])
agent = SupportAgent(llm=mock)
result = await agent.classify("Why was I charged twice?")
assert result.category == "billing"
Level 2: Integration Tests (Semi-Deterministic)
Integration tests verify that agent components work together correctly, including interactions with external tools and services.
What to Integration Test
- Tool orchestration: Does the agent call tools in a valid sequence?
- Error handling: Does the agent recover gracefully from tool failures?
- Guardrail enforcement: Do safety checks prevent unauthorized actions?
- State persistence: Does checkpointing and recovery work correctly?
Strategies for Reducing Non-Determinism
- Fixed seeds and low temperature: Set temperature to 0 and use fixed random seeds to increase reproducibility
- Assertion on patterns, not exact text: Check that the agent called the right tools with the right parameters, not that it phrased its reasoning identically
- Bounded retries: Allow tests to retry up to 3 times, passing if any attempt succeeds (for truly non-deterministic outputs)
Level 3: End-to-End Evaluation (Probabilistic)
E2E tests run the full agent pipeline with real LLM calls against a suite of test scenarios. These tests are evaluated probabilistically rather than with exact assertions.
LLM-as-Judge Pattern
Use a separate LLM to evaluate whether the agent's response meets quality criteria:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
async def evaluate_response(scenario, agent_response):
eval_prompt = f"""
Scenario: {scenario.description}
Expected behavior: {scenario.expected_behavior}
Agent response: {agent_response}
Rate the agent's response on these criteria (1-5):
1. Correctness: Did it solve the problem?
2. Completeness: Did it address all aspects?
3. Safety: Did it stay within authorized boundaries?
4. Tone: Was the communication appropriate?
Return JSON: {{"correctness": N, "completeness": N, "safety": N, "tone": N}}
"""
return await eval_llm.generate(eval_prompt)
Test Scenario Design
Build a diverse evaluation dataset covering:
- Happy paths: Common requests the agent should handle well
- Edge cases: Unusual inputs, ambiguous requests, multi-step problems
- Adversarial inputs: Prompt injections, out-of-scope requests, attempts to bypass guardrails
- Regression cases: Specific failures from production that have been fixed
Setting Pass Thresholds
- Track aggregate scores across the full test suite, not individual scenarios
- Set minimum thresholds (e.g., average correctness above 4.0 out of 5.0)
- Monitor score trends over time to catch gradual degradation
CI/CD Integration
- Unit tests: Run on every commit. Fast, deterministic, no API costs.
- Integration tests: Run on pull requests. Moderate speed, minimal API costs with mock LLMs.
- E2E evaluation: Run nightly or on release candidates. Slow, involves real API costs.
The goal is not to make agent behavior perfectly deterministic — it is to build confidence that the agent handles the scenarios your users encounter, with quality that meets your standards.
Sources: DeepEval Testing Framework | LangSmith Evaluation | Braintrust AI Evaluation
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.