Skip to content
Learn Agentic AI
Learn Agentic AI12 min read8 views

Multi-Agent System Evaluation: Measuring Coordination Quality and Handoff Success

Learn how to evaluate multi-agent AI systems by measuring handoff accuracy, information retention across agents, routing correctness, and end-to-end coordination quality.

The Unique Challenge of Multi-Agent Evaluation

Evaluating a single agent is hard enough. Evaluating a system of agents that coordinate, delegate, and hand off to each other introduces entirely new failure modes. The individual agents might each perform well in isolation, yet the system fails because information gets lost during handoffs, the wrong agent receives a task, or two agents give contradictory answers to the same user.

Multi-agent evaluation requires metrics that span agent boundaries: handoff accuracy, information retention, routing correctness, and end-to-end coherence. You cannot get these by evaluating each agent independently.

Modeling Multi-Agent Conversations

Start by structuring how you represent multi-agent interactions for evaluation.

flowchart TD
    INPUT(["Task input"])
    SUPER["Supervisor agent<br/>plans plus monitors"]
    W1["Worker 1<br/>research"]
    W2["Worker 2<br/>code"]
    W3["Worker 3<br/>writing"]
    CRITIC{"Output meets<br/>rubric?"}
    REWORK["Rework or<br/>retry path"]
    SHARED[("Shared scratchpad<br/>and memory")]
    OUT(["Final result"])
    INPUT --> SUPER
    SUPER --> W1 --> CRITIC
    SUPER --> W2 --> CRITIC
    SUPER --> W3 --> CRITIC
    W1 --> SHARED
    W2 --> SHARED
    W3 --> SHARED
    SHARED --> SUPER
    CRITIC -->|Pass| OUT
    CRITIC -->|Fail| REWORK --> SUPER
    style SUPER fill:#4f46e5,stroke:#4338ca,color:#fff
    style CRITIC fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OUT fill:#059669,stroke:#047857,color:#fff
    style SHARED fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
from dataclasses import dataclass, field
from typing import Optional
from enum import Enum

class HandoffReason(Enum):
    CAPABILITY_MATCH = "capability_match"
    ESCALATION = "escalation"
    SPECIALIZATION = "specialization"
    FALLBACK = "fallback"

@dataclass
class AgentTurn:
    agent_id: str
    agent_role: str
    message: str
    tool_calls: list[dict] = field(default_factory=list)
    turn_index: int = 0

@dataclass
class Handoff:
    from_agent: str
    to_agent: str
    reason: HandoffReason
    context_passed: dict = field(default_factory=dict)
    turn_index: int = 0

@dataclass
class MultiAgentTrace:
    conversation_id: str
    turns: list[AgentTurn] = field(default_factory=list)
    handoffs: list[Handoff] = field(default_factory=list)
    user_messages: list[dict] = field(default_factory=list)

    def agents_involved(self) -> list[str]:
        seen = []
        for turn in self.turns:
            if turn.agent_id not in seen:
                seen.append(turn.agent_id)
        return seen

    def handoff_count(self) -> int:
        return len(self.handoffs)

    def turns_per_agent(self) -> dict[str, int]:
        counts = {}
        for turn in self.turns:
            counts[turn.agent_id] = (
                counts.get(turn.agent_id, 0) + 1
            )
        return counts

This trace captures the full conversation timeline with which agent spoke when, every handoff event, and the context that was passed during each transition.

Measuring Handoff Accuracy

A handoff is accurate when the right agent receives the task for the right reason with the right context.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
@dataclass
class HandoffExpectation:
    expected_target: str
    expected_reason: HandoffReason
    required_context_keys: list[str] = field(
        default_factory=list
    )

def score_handoff_accuracy(
    actual_handoffs: list[Handoff],
    expected: list[HandoffExpectation],
) -> dict:
    if not expected:
        return {
            "handoff_accuracy": 1.0 if not actual_handoffs else 0.0,
            "unexpected_handoffs": len(actual_handoffs),
        }

    results = []
    for i, exp in enumerate(expected):
        if i >= len(actual_handoffs):
            results.append({
                "index": i,
                "target_correct": False,
                "reason_correct": False,
                "context_complete": False,
                "status": "missing",
            })
            continue

        actual = actual_handoffs[i]
        target_ok = actual.to_agent == exp.expected_target
        reason_ok = actual.reason == exp.expected_reason
        context_ok = all(
            key in actual.context_passed
            for key in exp.required_context_keys
        )

        results.append({
            "index": i,
            "target_correct": target_ok,
            "reason_correct": reason_ok,
            "context_complete": context_ok,
            "actual_target": actual.to_agent,
            "expected_target": exp.expected_target,
        })

    target_accuracy = sum(
        1 for r in results if r["target_correct"]
    ) / len(results)
    context_completeness = sum(
        1 for r in results if r.get("context_complete", False)
    ) / len(results)

    return {
        "target_accuracy": round(target_accuracy, 3),
        "context_completeness": round(context_completeness, 3),
        "handoff_details": results,
        "unexpected_handoffs": max(
            0, len(actual_handoffs) - len(expected)
        ),
    }

Context completeness is the most frequently overlooked metric. An agent might route to the correct specialist, but if it drops the customer's account number during the handoff, the specialist has to ask for it again — creating a frustrating user experience.

Information Retention Across Handoffs

Measure whether information mentioned before a handoff is available and used after it.

async def score_information_retention(
    llm_client,
    pre_handoff_messages: list[str],
    post_handoff_messages: list[str],
    key_facts: list[str],
) -> dict:
    facts_text = "\n".join(
        f"- {fact}" for fact in key_facts
    )
    post_text = "\n".join(post_handoff_messages[:5])

    prompt = f"""Evaluate whether key information from before
the agent handoff is retained and used after the handoff.

## Key Facts (established before handoff)
{facts_text}

## Post-Handoff Agent Messages
{post_text}

For each fact, determine:
- "retained": the agent demonstrates awareness of this fact
- "lost": the agent ignores or re-asks for this information
- "contradicted": the agent states something conflicting

Return JSON:
{{
  "facts": [
    {{"fact": "...", "status": "retained|lost|contradicted"}}
  ]
}}"""

    response = await llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        temperature=0.0,
    )
    import json
    result = json.loads(response.choices[0].message.content)

    statuses = [f["status"] for f in result["facts"]]
    retained = statuses.count("retained")
    return {
        "retention_rate": round(
            retained / len(statuses), 3
        ) if statuses else 1.0,
        "retained": retained,
        "lost": statuses.count("lost"),
        "contradicted": statuses.count("contradicted"),
        "details": result["facts"],
    }

A contradicted fact is worse than a lost one. If the first agent says "Your appointment is on Tuesday" and the second agent says "Your appointment is on Thursday," the user loses trust in the entire system.

Routing Correctness Evaluation

In systems with a triage or router agent, measure whether user intents get sent to the right specialist.

@dataclass
class RoutingTestCase:
    user_input: str
    correct_agent: str
    acceptable_agents: list[str] = field(
        default_factory=list
    )

def score_routing(
    test_cases: list[RoutingTestCase],
    actual_routes: list[str],
) -> dict:
    exact_matches = 0
    acceptable_matches = 0

    for case, actual in zip(test_cases, actual_routes):
        if actual == case.correct_agent:
            exact_matches += 1
            acceptable_matches += 1
        elif actual in case.acceptable_agents:
            acceptable_matches += 1

    n = len(test_cases)
    return {
        "exact_routing_accuracy": round(
            exact_matches / n, 3
        ) if n else 0.0,
        "acceptable_routing_accuracy": round(
            acceptable_matches / n, 3
        ) if n else 0.0,
        "total_cases": n,
        "misrouted": n - acceptable_matches,
    }

The distinction between exact and acceptable routing matters. If a billing question goes to the general support agent instead of the billing specialist, that is suboptimal but acceptable. If it goes to the technical troubleshooting agent, that is a misroute.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

End-to-End Coordination Score

Combine all multi-agent metrics into a single coordination quality score.

def coordination_score(
    handoff_report: dict,
    retention_report: dict,
    routing_report: dict,
) -> dict:
    handoff_score = handoff_report.get(
        "target_accuracy", 0
    ) * 0.5 + handoff_report.get(
        "context_completeness", 0
    ) * 0.5

    retention_score = retention_report.get(
        "retention_rate", 0
    )
    # Penalize contradictions heavily
    contradictions = retention_report.get("contradicted", 0)
    retention_score = max(
        0, retention_score - contradictions * 0.2
    )

    routing_score = routing_report.get(
        "acceptable_routing_accuracy", 0
    )

    composite = (
        handoff_score * 0.3
        + retention_score * 0.4
        + routing_score * 0.3
    )

    return {
        "handoff_quality": round(handoff_score, 3),
        "information_retention": round(retention_score, 3),
        "routing_quality": round(routing_score, 3),
        "composite_coordination": round(composite, 3),
    }

Information retention gets the highest weight because it has the strongest correlation with user satisfaction. Users can tolerate a brief misroute that gets corrected. They cannot tolerate repeating themselves after every handoff.

FAQ

How do I test handoffs when agents are developed by different teams?

Define a handoff contract — a schema that specifies exactly what context fields must be passed during each type of handoff. Each team tests that their agent produces the correct output contract and correctly consumes the input contract. Then run end-to-end integration tests that verify the contracts work together. This is analogous to API contract testing in microservices.

What is a good routing accuracy target for a triage agent?

Target 90 percent or higher acceptable routing accuracy. Below 85 percent, users will notice frequent misroutes. For systems with only two or three specialist agents, you should aim for 95 percent because the routing task is simpler. As the number of specialists grows, acceptable accuracy naturally drops — consider hierarchical routing (triage to category, then category to specialist) to maintain high accuracy.

How do I handle circular handoffs where agents keep passing the user back and forth?

Detect circular handoffs by tracking the agent sequence. If the same pair of agents hand off to each other more than once in a conversation, flag it as a coordination failure. Set a maximum handoff count per conversation (typically three to five) and escalate to a human when the limit is reached. Log circular patterns to identify systemic gaps in agent capabilities.


#MultiAgent #AgentHandoff #Evaluation #Orchestration #Python #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Agentic AI

Human-in-the-Loop Hybrid Agents: 73% Fewer Errors in 2026

Fully autonomous agents are still a fantasy in production. LangGraph's interrupt() lets you pause for human approval mid-graph without losing state. We cover approve/edit/reject/respond actions and CallSphere's escalation ladder.

Agentic AI

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.

AI Strategy

Enterprise CIO Guide: AutoGen 0.5 — Microsoft's Multi-Agent Refresh

Enterprise CIO Guide perspective on AutoGen 0.5 brings async-first execution, an extension architecture, and tighter Azure integration.

AI Strategy

Enterprise CIO Guide: Cursor 2.0 — Multi-Agent Coding Hits the Mainstream

Enterprise CIO Guide perspective on Cursor 2.0 ships background agents, parallel branches, and a redesigned composer — multi-agent coding is no longer an experiment.

AI Strategy

Enterprise CIO Guide: Claude Code 2.1 — Multi-Agent Coding for Real

Enterprise CIO Guide perspective on Claude Code 2.1 ships background agents, sub-agent spawning, and a hooks API that turn it into a true multi-agent coding platform.

Agentic AI

Smolagents: Hugging Face's Code-First Agent Framework Reviewed

Smolagents lets agents write Python instead of JSON. Why code-as-action reduces tool errors and where the security trade-offs are for production deployments.