The Testing Problem Is Different for Agents

Traditional software testing relies on deterministic behavior: given input X, expect output Y. AI agents introduce non-determinism at their core — the same input can produce different outputs, different tool call sequences, and different reasoning paths. This does not mean agents are untestable. It means we need a testing framework designed for probabilistic systems.

A practical agent testing strategy operates at three levels, each catching different categories of defects.

Level 1: Unit Tests (Deterministic)

Unit tests validate the deterministic components of your agent system — everything except the LLM calls themselves.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness<br/>PromptFoo or Braintrust"]
    GOLD[("Golden set<br/>200 tagged cases")]
    JUDGE["LLM as judge<br/>plus regex graders"]
    SCORE["Aggregate score<br/>and per slice"]
    GATE{"Score regress<br/>more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff

What to Unit Test

Tool functions: Each tool the agent can call should have standard unit tests with known inputs and expected outputs
State management: State transitions, reducers, and serialization logic
Input validation: Prompt template rendering, parameter parsing, and guardrail logic
Output parsing: Extracting structured data from LLM responses

# Test a tool function deterministically
def test_calculate_shipping_cost():
    result = calculate_shipping(weight_kg=2.5, destination="US", method="express")
    assert result["cost"] == 24.99
    assert result["estimated_days"] == 3

# Test output parsing
def test_parse_agent_action():
    raw_response = "I'll look up the order. ACTION: get_order(order_id='ORD-123')"
    action = parse_action(raw_response)
    assert action.tool == "get_order"
    assert action.params == {"order_id": "ORD-123"}

Mock LLM Responses

For unit testing agent control flow, replace the LLM with deterministic mock responses:

class MockLLM:
    def __init__(self, responses: list[str]):
        self.responses = iter(responses)

    async def generate(self, prompt: str) -> str:
        return next(self.responses)

# Test the agent's decision logic with predictable LLM outputs
async def test_agent_routes_to_billing():
    mock = MockLLM(["The customer is asking about billing."])
    agent = SupportAgent(llm=mock)
    result = await agent.classify("Why was I charged twice?")
    assert result.category == "billing"

Level 2: Integration Tests (Semi-Deterministic)

Integration tests verify that agent components work together correctly, including interactions with external tools and services.

What to Integration Test

Tool orchestration: Does the agent call tools in a valid sequence?
Error handling: Does the agent recover gracefully from tool failures?
Guardrail enforcement: Do safety checks prevent unauthorized actions?
State persistence: Does checkpointing and recovery work correctly?

Strategies for Reducing Non-Determinism

Fixed seeds and low temperature: Set temperature to 0 and use fixed random seeds to increase reproducibility
Assertion on patterns, not exact text: Check that the agent called the right tools with the right parameters, not that it phrased its reasoning identically
Bounded retries: Allow tests to retry up to 3 times, passing if any attempt succeeds (for truly non-deterministic outputs)

Level 3: End-to-End Evaluation (Probabilistic)

E2E tests run the full agent pipeline with real LLM calls against a suite of test scenarios. These tests are evaluated probabilistically rather than with exact assertions.

LLM-as-Judge Pattern

Use a separate LLM to evaluate whether the agent's response meets quality criteria:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

async def evaluate_response(scenario, agent_response):
    eval_prompt = f"""
    Scenario: {scenario.description}
    Expected behavior: {scenario.expected_behavior}
    Agent response: {agent_response}

    Rate the agent's response on these criteria (1-5):
    1. Correctness: Did it solve the problem?
    2. Completeness: Did it address all aspects?
    3. Safety: Did it stay within authorized boundaries?
    4. Tone: Was the communication appropriate?

    Return JSON: {{"correctness": N, "completeness": N, "safety": N, "tone": N}}
    """
    return await eval_llm.generate(eval_prompt)

Test Scenario Design

Build a diverse evaluation dataset covering:

Happy paths: Common requests the agent should handle well
Edge cases: Unusual inputs, ambiguous requests, multi-step problems
Adversarial inputs: Prompt injections, out-of-scope requests, attempts to bypass guardrails
Regression cases: Specific failures from production that have been fixed

Setting Pass Thresholds

Track aggregate scores across the full test suite, not individual scenarios
Set minimum thresholds (e.g., average correctness above 4.0 out of 5.0)
Monitor score trends over time to catch gradual degradation

CI/CD Integration

Unit tests: Run on every commit. Fast, deterministic, no API costs.
Integration tests: Run on pull requests. Moderate speed, minimal API costs with mock LLMs.
E2E evaluation: Run nightly or on release candidates. Slow, involves real API costs.

The goal is not to make agent behavior perfectly deterministic — it is to build confidence that the agent handles the scenarios your users encounter, with quality that meets your standards.

Sources: DeepEval Testing Framework | LangSmith Evaluation | Braintrust AI Evaluation

AI Agent Testing Strategies: Unit, Integration, and End-to-End Approaches

The Testing Problem Is Different for Agents

Level 1: Unit Tests (Deterministic)

What to Unit Test

Mock LLM Responses

Level 2: Integration Tests (Semi-Deterministic)

What to Integration Test

Strategies for Reducing Non-Determinism

Level 3: End-to-End Evaluation (Probabilistic)

LLM-as-Judge Pattern

Test Scenario Design

Setting Pass Thresholds

CI/CD Integration

Try CallSphere AI Voice Agents

Related Articles You May Like

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

LLM-as-Judge: Why Pairwise Evaluation Beats Reference-Based Scoring for Agents

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

The Agent Evaluation Stack in 2026: From Trace to Eval Score

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)