Agent Effectiveness Metrics: Resolution Rate, Containment, and First-Contact Resolution

The Metrics That Actually Matter

Deploying an AI agent is the easy part. Knowing whether it works well is hard. Teams that track vanity metrics like total conversations or average response time miss the real picture. The three metrics that define agent effectiveness are resolution rate, containment rate, and first-contact resolution.

These metrics answer the questions that stakeholders actually care about: Does the agent solve problems? Does it prevent escalations? Does it solve problems on the first try?

Metric Definitions

Understanding what each metric measures and how it differs from the others is essential before writing any calculation code.

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

Resolution Rate measures the percentage of conversations where the user's issue was actually solved. A conversation is resolved if the user confirms the solution worked or if the agent successfully completed the requested action.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Containment Rate measures the percentage of conversations handled entirely by the agent without human escalation. A contained conversation may or may not be resolved — the user might give up and leave, which counts as contained but unresolved.

First-Contact Resolution (FCR) measures the percentage of issues resolved in a single conversation, without the user needing to come back and ask again about the same problem.

from dataclasses import dataclass
from enum import Enum

class ConversationOutcome(Enum):
    RESOLVED = "resolved"
    UNRESOLVED = "unresolved"
    ESCALATED = "escalated"
    ABANDONED = "abandoned"

@dataclass
class ConversationRecord:
    conversation_id: str
    user_id: str
    outcome: ConversationOutcome
    escalated_to_human: bool
    topic: str
    message_count: int
    duration_seconds: float
    followup_conversation_id: str | None = None

Calculating the Core Metrics

With structured conversation records, the calculations themselves are straightforward. The challenge is getting accurate outcome labels, not doing the math.

class EffectivenessCalculator:
    def __init__(self, records: list[ConversationRecord]):
        self.records = records

    def resolution_rate(self) -> float:
        if not self.records:
            return 0.0
        resolved = sum(
            1 for r in self.records
            if r.outcome == ConversationOutcome.RESOLVED
        )
        return resolved / len(self.records) * 100

    def containment_rate(self) -> float:
        if not self.records:
            return 0.0
        contained = sum(
            1 for r in self.records
            if not r.escalated_to_human
        )
        return contained / len(self.records) * 100

    def first_contact_resolution(self) -> float:
        if not self.records:
            return 0.0
        resolved_no_followup = sum(
            1 for r in self.records
            if r.outcome == ConversationOutcome.RESOLVED
            and r.followup_conversation_id is None
        )
        total_resolved = sum(
            1 for r in self.records
            if r.outcome == ConversationOutcome.RESOLVED
        )
        if total_resolved == 0:
            return 0.0
        return resolved_no_followup / total_resolved * 100

    def summary(self) -> dict:
        return {
            "total_conversations": len(self.records),
            "resolution_rate": round(self.resolution_rate(), 1),
            "containment_rate": round(self.containment_rate(), 1),
            "first_contact_resolution": round(
                self.first_contact_resolution(), 1
            ),
        }

Outcome Labeling

The hardest part of effectiveness measurement is determining the conversation outcome. There are three approaches: explicit user feedback, implicit signal detection, and LLM-based classification.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

from openai import OpenAI
import json

client = OpenAI()

def classify_outcome(messages: list[dict]) -> ConversationOutcome:
    formatted = "\n".join(
        f"{m['role']}: {m['content']}" for m in messages
    )
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": (
                "Classify this support conversation outcome. "
                "Return JSON: {\"outcome\": \"resolved\" | "
                "\"unresolved\" | \"escalated\" | \"abandoned\"}\n"
                "resolved = user's issue was solved\n"
                "unresolved = conversation ended without solving the issue\n"
                "escalated = transferred to a human agent\n"
                "abandoned = user stopped responding"
            )},
            {"role": "user", "content": formatted},
        ],
        response_format={"type": "json_object"},
    )
    result = json.loads(response.choices[0].message.content)
    return ConversationOutcome(result["outcome"])

Benchmarking and Improvement

Industry benchmarks give you a target to aim for. For customer support agents, a resolution rate above 70% is good, above 85% is excellent. Containment rates above 80% are typical for well-tuned agents. FCR above 75% indicates the agent is thorough in its responses.

BENCHMARKS = {
    "resolution_rate": {"poor": 50, "good": 70, "excellent": 85},
    "containment_rate": {"poor": 60, "good": 80, "excellent": 90},
    "first_contact_resolution": {"poor": 50, "good": 65, "excellent": 80},
}

def benchmark_report(metrics: dict) -> list[dict]:
    report = []
    for metric, value in metrics.items():
        if metric in BENCHMARKS:
            thresholds = BENCHMARKS[metric]
            if value >= thresholds["excellent"]:
                rating = "excellent"
            elif value >= thresholds["good"]:
                rating = "good"
            else:
                rating = "needs improvement"
            report.append({
                "metric": metric,
                "value": value,
                "rating": rating,
                "target": thresholds["excellent"],
                "gap": round(thresholds["excellent"] - value, 1),
            })
    return report

def topic_breakdown(records: list[ConversationRecord]) -> dict:
    from collections import defaultdict
    topics: dict[str, list] = defaultdict(list)
    for r in records:
        topics[r.topic].append(r)
    breakdown = {}
    for topic, topic_records in topics.items():
        calc = EffectivenessCalculator(topic_records)
        breakdown[topic] = calc.summary()
    return breakdown

FAQ

How do I handle conversations where the user never confirms resolution?

Use a combination of implicit signals and LLM classification. Implicit signals include the user saying "thanks" or "that worked," closing the chat window after receiving an answer, or not returning with the same issue within a defined window (e.g., 48 hours). LLM-based classification can catch subtler positive signals. Default to "unresolved" when uncertain — it is better to undercount resolutions than overcount them.

What is the relationship between containment rate and resolution rate?

They measure different things and can diverge significantly. A high containment rate with a low resolution rate means the agent keeps conversations but fails to solve problems — users give up rather than escalate. The ideal is high containment and high resolution together. If you must prioritize, resolution rate is more important because an unresolved contained conversation is a frustrated user.

How often should I recalculate these metrics?

Calculate daily aggregates and expose rolling 7-day and 30-day averages. Daily numbers are noisy, especially at lower volumes. The 7-day rolling average smooths out day-of-week effects while still showing trends. Set up alerts when the 7-day average drops more than 5 percentage points from its 30-day baseline.

#Metrics #ResolutionRate #KPIs #Analytics #AIAgents #AgenticAI #LearnAI #AIEngineering

Agent Effectiveness Metrics: Resolution Rate, Containment, and First-Contact Resolution

The Metrics That Actually Matter

Metric Definitions

Calculating the Core Metrics

Outcome Labeling

Benchmarking and Improvement

FAQ

How do I handle conversations where the user never confirms resolution?

What is the relationship between containment rate and resolution rate?

How often should I recalculate these metrics?

Try CallSphere AI Voice Agents

Related Articles You May Like

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

LangGraph Supervisor Pattern: Orchestrating Multi-Agent Teams in 2026