Tool Usage Accuracy: Evaluating Whether Agents Call the Right Tools with Right Parameters

Why Tool Usage Accuracy Is Critical

An AI agent's power comes from the tools it can call — APIs, databases, calculators, search engines. But a tool called incorrectly is worse than no tool call at all. A wrong API parameter can book the wrong flight, charge the wrong amount, or delete the wrong record. Tool usage accuracy measures whether your agent selects the correct tool for a given intent and passes the correct parameters every time.

This metric splits into three sub-dimensions: tool selection accuracy (did it pick the right tool?), parameter accuracy (did it pass the right values?), and sequencing accuracy (did it call tools in the right order for multi-step operations?).

Logging Tool Calls for Evaluation

The foundation of tool accuracy measurement is a detailed log of every tool call the agent makes.

flowchart TD
    USER(["User message"])
    LLM["LLM call<br/>with tools schema"]
    DECIDE{"Model wants<br/>to call a tool?"}
    EXEC["Execute tool<br/>sandboxed runtime"]
    RESULT["Append tool_result<br/>to messages"]
    GUARD{"Output passes<br/>guardrails?"}
    DONE(["Final reply"])
    BLOCK(["Refuse and log"])
    USER --> LLM --> DECIDE
    DECIDE -->|Yes| EXEC --> RESULT --> LLM
    DECIDE -->|No| GUARD
    GUARD -->|Yes| DONE
    GUARD -->|No| BLOCK
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style EXEC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style DONE fill:#059669,stroke:#047857,color:#fff
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff

from dataclasses import dataclass, field
from datetime import datetime
from typing import Any, Optional
import json

@dataclass
class ToolCallLog:
    call_id: str
    tool_name: str
    parameters: dict[str, Any]
    result: Any = None
    error: Optional[str] = None
    timestamp: str = field(
        default_factory=lambda: datetime.utcnow().isoformat()
    )
    latency_ms: Optional[int] = None

@dataclass
class ConversationToolTrace:
    conversation_id: str
    calls: list[ToolCallLog] = field(default_factory=list)

    def add_call(self, call: ToolCallLog):
        self.calls.append(call)

    def tool_sequence(self) -> list[str]:
        return [c.tool_name for c in self.calls]

    def to_dict(self) -> dict:
        return {
            "conversation_id": self.conversation_id,
            "calls": [
                {
                    "call_id": c.call_id,
                    "tool_name": c.tool_name,
                    "parameters": c.parameters,
                    "error": c.error,
                }
                for c in self.calls
            ],
        }

Wrap your tool execution layer so every call is automatically captured. Never rely on the agent to self-report which tools it called — instrument the execution layer directly.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Measuring Tool Selection Accuracy

Given a user intent, did the agent pick the correct tool? This requires a ground truth mapping from intents to expected tools.

@dataclass
class ToolAccuracyEval:
    expected_tool: str
    expected_params: dict[str, Any]
    param_match_mode: str = "exact"  # exact, subset, fuzzy

def score_tool_selection(
    actual_calls: list[ToolCallLog],
    expected: list[ToolAccuracyEval],
) -> dict:
    if not expected:
        return {
            "selection_accuracy": 1.0 if not actual_calls else 0.0,
            "spurious_calls": len(actual_calls),
        }

    matched = 0
    for i, exp in enumerate(expected):
        if i < len(actual_calls):
            if actual_calls[i].tool_name == exp.expected_tool:
                matched += 1

    return {
        "selection_accuracy": matched / len(expected),
        "expected_count": len(expected),
        "actual_count": len(actual_calls),
        "spurious_calls": max(0, len(actual_calls) - len(expected)),
        "missed_calls": max(0, len(expected) - len(actual_calls)),
    }

Spurious calls — tools the agent called that it should not have — are just as important as missed calls. An agent that calls a payment API unnecessarily is a liability.

Parameter Validation Scoring

Selecting the right tool is necessary but not sufficient. The parameters must also be correct.

from typing import Union

def score_parameters(
    actual: dict[str, Any],
    expected: dict[str, Any],
    mode: str = "exact",
) -> dict:
    if mode == "exact":
        return _exact_match(actual, expected)
    elif mode == "subset":
        return _subset_match(actual, expected)
    elif mode == "fuzzy":
        return _fuzzy_match(actual, expected)
    raise ValueError(f"Unknown mode: {mode}")

def _exact_match(actual: dict, expected: dict) -> dict:
    correct = 0
    total = len(expected)
    errors = []

    for key, exp_value in expected.items():
        act_value = actual.get(key)
        if act_value == exp_value:
            correct += 1
        else:
            errors.append({
                "param": key,
                "expected": exp_value,
                "actual": act_value,
            })

    extra_params = set(actual.keys()) - set(expected.keys())

    return {
        "param_accuracy": correct / total if total > 0 else 1.0,
        "correct": correct,
        "total": total,
        "errors": errors,
        "extra_params": list(extra_params),
    }

def _subset_match(actual: dict, expected: dict) -> dict:
    correct = sum(
        1 for k, v in expected.items()
        if actual.get(k) == v
    )
    return {
        "param_accuracy": correct / len(expected) if expected else 1.0,
        "correct": correct,
        "total": len(expected),
    }

def _fuzzy_match(actual: dict, expected: dict) -> dict:
    correct = 0
    for key, exp_value in expected.items():
        act_value = actual.get(key)
        if act_value == exp_value:
            correct += 1
        elif (
            isinstance(exp_value, str)
            and isinstance(act_value, str)
            and exp_value.lower().strip() == act_value.lower().strip()
        ):
            correct += 1
    return {
        "param_accuracy": correct / len(expected) if expected else 1.0,
        "correct": correct,
        "total": len(expected),
    }

Use exact match for IDs, amounts, and dates. Use fuzzy match for names and free-text fields where minor differences are acceptable. Always log the specific parameter errors — they reveal systematic patterns like date format confusion or unit mismatches.

Sequence Accuracy for Multi-Step Operations

Some tasks require tools to be called in a specific order. Checking availability before booking, or looking up a customer before modifying their account.

def score_sequence(
    actual_sequence: list[str],
    expected_sequence: list[str],
) -> dict:
    if not expected_sequence:
        return {"sequence_accuracy": 1.0}

    # Longest common subsequence approach
    m, n = len(actual_sequence), len(expected_sequence)
    dp = [[0] * (n + 1) for _ in range(m + 1)]
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            if actual_sequence[i-1] == expected_sequence[j-1]:
                dp[i][j] = dp[i-1][j-1] + 1
            else:
                dp[i][j] = max(dp[i-1][j], dp[i][j-1])

    lcs_length = dp[m][n]
    return {
        "sequence_accuracy": lcs_length / len(expected_sequence),
        "lcs_length": lcs_length,
        "expected_length": len(expected_sequence),
        "actual_length": len(actual_sequence),
    }

The longest common subsequence (LCS) approach is forgiving of extra calls the agent inserts (like a redundant lookup) while still penalizing wrong ordering and missing steps.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Putting It All Together

Combine selection, parameter, and sequence scores into a single tool usage report.

def full_tool_accuracy_report(
    trace: ConversationToolTrace,
    expected_evals: list[ToolAccuracyEval],
) -> dict:
    selection = score_tool_selection(trace.calls, expected_evals)
    param_scores = []
    for i, exp in enumerate(expected_evals):
        if i < len(trace.calls):
            ps = score_parameters(
                trace.calls[i].parameters,
                exp.expected_params,
                exp.param_match_mode,
            )
            param_scores.append(ps["param_accuracy"])
    sequence = score_sequence(
        trace.tool_sequence(),
        [e.expected_tool for e in expected_evals],
    )
    avg_param = (
        sum(param_scores) / len(param_scores)
        if param_scores else 0.0
    )
    return {
        "selection": selection,
        "avg_param_accuracy": round(avg_param, 3),
        "sequence": sequence,
        "composite_score": round(
            selection["selection_accuracy"] * 0.4
            + avg_param * 0.4
            + sequence["sequence_accuracy"] * 0.2,
            3,
        ),
    }

FAQ

How do I build ground truth for tool call evaluation?

Start with your most common user intents. For each intent, manually define the expected tool calls and parameters. Use production conversation logs as your source — sample 50 conversations per task type and annotate the correct tool sequence. Automate what you can with deterministic rules, and use human annotators for ambiguous cases.

What is an acceptable tool selection accuracy?

For production agents handling real transactions, target 95 percent or higher tool selection accuracy. Anything below 90 percent means roughly one in ten user requests triggers the wrong action. For read-only tools like search or lookup, 85 percent is workable. For tools that modify state — payments, bookings, deletions — you need near-perfect accuracy.

How do I handle cases where multiple tool sequences are valid?

Define a set of acceptable sequences rather than a single expected sequence. Score against the best-matching sequence from the set. Alternatively, define ordering constraints (A must come before B) rather than a full sequence, and verify that all constraints are satisfied.

#ToolUse #AgentEvaluation #FunctionCalling #Python #Benchmarking #AgenticAI #LearnAI #AIEngineering

Tool Usage Accuracy: Evaluating Whether Agents Call the Right Tools with Right Parameters

Why Tool Usage Accuracy Is Critical

Logging Tool Calls for Evaluation

Measuring Tool Selection Accuracy

Parameter Validation Scoring

Sequence Accuracy for Multi-Step Operations

Putting It All Together

FAQ

How do I build ground truth for tool call evaluation?

What is an acceptable tool selection accuracy?

How do I handle cases where multiple tool sequences are valid?

Try CallSphere AI Voice Agents

Related Articles You May Like

Agent Tracing 101: Spans, Sessions, and the Hidden Failure Modes They Reveal

The Agent Evaluation Stack in 2026: From Trace to Eval Score

LLM-as-Judge: Why Pairwise Evaluation Beats Reference-Based Scoring for Agents

How to Build a Golden Dataset for Production AI Agents

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

Online vs Offline Agent Evaluation: The Pre-Deploy / Post-Deploy Split