Agent Evaluation Beyond Accuracy: Trajectory, Tool-Use, and Cost Metrics
Accuracy alone misses what is actually wrong with your agent. The 2026 metrics teams use to evaluate agentic systems before and after deployment.
Why Single-Number Accuracy Fails Agents
A non-agentic LLM is essentially a single-output function: input goes in, output comes out, you grade it. Agents are paths through state space. Two agents can produce identical correct answers via wildly different trajectories — one cheap and reliable, one a 47-step disaster that happened to get there. Single-number accuracy hides this completely.
In 2026 the eval stacks that work measure four things at once: outcome, trajectory, tool use, and cost. This piece walks through what each one measures, the open-source frameworks that implement them, and the dashboards that actually get watched.
The Four-Dimensional Eval
flowchart TB
Run[Agent Run] --> Out[Outcome: did it succeed?]
Run --> Tr[Trajectory: was the path good?]
Run --> Tu[Tool Use: were calls correct?]
Run --> C[Cost: was it efficient?]
Out --> Score[Composite Score]
Tr --> Score
Tu --> Score
C --> Score
Score --> Gate[Release Gate]
Outcome
Did the final state of the world (database row, email sent, code change) match the goal? This is the only fully objective metric. For deterministic tasks (SWE-bench, AppWorld, Tau-Bench) it is exact match or unit-test pass. For open-ended tasks, you need a stronger LLM judge with a rubric.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Trajectory Score
Was each step a reasonable continuation of the previous step? In 2026 the standard is Anthropic's trajectory rubric: an LLM judge scores each (state, action) pair on a 1-5 scale and the trajectory score is the geometric mean. Geometric mean punishes any single bad step, which is what you want — one obviously wrong step should not be averaged out by 19 fine ones.
Tool-Use Correctness
Three sub-metrics that most teams now track:
- Selection accuracy — did the agent pick the right tool?
- Argument correctness — were the arguments syntactically and semantically right?
- Repetition rate — fraction of calls that duplicate a previous call's effect
Berkeley Function Calling Leaderboard V3 and Tau-Bench measure these directly with held-out test sets. For your own agent, you instrument every tool call and pipe the results to your eval harness.
Cost
Per-task dollar cost, p50/p95 latency, and token consumption. This is the metric most teams forget until the bill arrives. By 2026 the better eval frameworks (Braintrust, LangSmith, Inspect AI, Arize, Phoenix) emit cost as a first-class signal alongside outcome.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
The Per-Step vs Per-Trajectory Question
A common mistake: scoring trajectories only at the end. If an agent run is 100 steps, ending evaluation only at the final answer means a 99-step disaster gets the same trajectory score as a 99-step beautiful path that happened to end equally. Score per-step. Aggregate at the trajectory level. Your dashboard should show both.
Eval Pipeline That Ships
sequenceDiagram
participant CI as CI Pipeline
participant H as Eval Harness
participant A as Agent
participant J as LLM Judge
participant D as Dashboard
CI->>H: trigger on PR
H->>A: run task suite
A->>H: trajectory + tool log
H->>J: rubric-based grading
J->>H: scores
H->>D: emit metrics
D->>CI: pass/fail gate
Three rules that make this stick in practice:
- Determinism where you can get it: pin model versions, seed where supported, snapshot tool fixtures
- Stratified test suites: split into unit, integration, regression, and adversarial — different gates for each
- Cost in the gate: a PR that doubles cost should fail the gate even if outcome is unchanged
The Open-Source Stack in 2026
- Inspect AI (UK AI Safety Institute) — sophisticated, frontier-grade rubric eval
- Braintrust and LangSmith — managed eval + tracing
- Phoenix (Arize) — open-source tracing with eval support
- Promptfoo — lightweight, CI-friendly
- DeepEval — Python-first, RAG-and-agent focused
What This Looks Like for a Voice Agent
CallSphere's healthcare voice agent runs through this stack on every model bump. The fixed-set tasks include "schedule appointment for new patient", "verify insurance for known patient", "handle reschedule with no available slot." Outcome is database state. Trajectory is judged. Tool-use accuracy is measured against ground-truth tool sequences. Cost includes both LLM and ASR/TTS minutes. A regression in any of the four dimensions blocks release.
Sources
- Tau-Bench paper — https://arxiv.org/abs/2406.12045
- Berkeley Function Calling Leaderboard — https://gorilla.cs.berkeley.edu/leaderboard.html
- Inspect AI — https://inspect.ai-safety-institute.org.uk
- LangSmith eval docs — https://docs.smith.langchain.com/evaluation
- Anthropic trajectory rubric — https://www.anthropic.com/research
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.