Measuring AI Support Agent Performance: CSAT, Resolution Rate, and Containment

You Cannot Improve What You Do Not Measure

Deploying an AI support agent without measuring its performance is like launching a product without analytics. You have no idea if customers are getting help, if the agent is making things worse, or where to invest improvement effort. The three metrics that matter most are containment rate (did the AI resolve it without a human?), resolution rate (was the issue actually solved?), and CSAT (was the customer satisfied?).

Defining the Core KPIs

Each metric captures a different dimension of performance. Together, they give a complete picture.

flowchart LR
    USER(["Customer"])
    CHANNEL{"Channel"}
    CHAT["Chat agent"]
    VOICE["Voice agent"]
    EMAIL["Email agent"]
    TRIAGE["Triage and<br/>intent detection"]
    KB[("Knowledge base<br/>RAG")]
    CRM[("CRM context")]
    AUTORES{"Auto resolvable?"}
    RESOLVE(["Resolved with<br/>cited answer"])
    HUMAN(["Tier 2 agent"])
    USER --> CHANNEL --> CHAT --> TRIAGE
    CHANNEL --> VOICE --> TRIAGE
    CHANNEL --> EMAIL --> TRIAGE
    TRIAGE --> KB
    TRIAGE --> CRM
    TRIAGE --> AUTORES
    AUTORES -->|Yes| RESOLVE
    AUTORES -->|No| HUMAN
    style TRIAGE fill:#4f46e5,stroke:#4338ca,color:#fff
    style AUTORES fill:#f59e0b,stroke:#d97706,color:#1f2937
    style RESOLVE fill:#059669,stroke:#047857,color:#fff
    style HUMAN fill:#0ea5e9,stroke:#0369a1,color:#fff

from dataclasses import dataclass, field
from datetime import datetime
from typing import Optional
from enum import Enum

class ConversationOutcome(Enum):
    RESOLVED_BY_AI = "resolved_by_ai"
    ESCALATED_TO_HUMAN = "escalated_to_human"
    ABANDONED = "abandoned"
    UNRESOLVED = "unresolved"

@dataclass
class ConversationRecord:
    id: str
    started_at: datetime
    ended_at: Optional[datetime]
    outcome: ConversationOutcome
    turn_count: int
    csat_score: Optional[int]     # 1-5
    resolution_confirmed: bool
    escalated: bool
    intent: str
    first_response_ms: int
    total_duration_seconds: int

@dataclass
class SupportMetrics:
    total_conversations: int = 0
    resolved_by_ai: int = 0
    escalated: int = 0
    abandoned: int = 0
    total_csat: float = 0.0
    csat_responses: int = 0
    confirmed_resolutions: int = 0
    total_first_response_ms: int = 0
    total_turns: int = 0

    @property
    def containment_rate(self) -> float:
        if self.total_conversations == 0:
            return 0.0
        return self.resolved_by_ai / self.total_conversations

    @property
    def escalation_rate(self) -> float:
        if self.total_conversations == 0:
            return 0.0
        return self.escalated / self.total_conversations

    @property
    def abandonment_rate(self) -> float:
        if self.total_conversations == 0:
            return 0.0
        return self.abandoned / self.total_conversations

    @property
    def avg_csat(self) -> float:
        if self.csat_responses == 0:
            return 0.0
        return self.total_csat / self.csat_responses

    @property
    def resolution_rate(self) -> float:
        resolved_total = self.resolved_by_ai + self.escalated
        if resolved_total == 0:
            return 0.0
        return self.confirmed_resolutions / resolved_total

    @property
    def avg_first_response_ms(self) -> float:
        if self.total_conversations == 0:
            return 0.0
        return self.total_first_response_ms / self.total_conversations

    @property
    def avg_turns(self) -> float:
        if self.total_conversations == 0:
            return 0.0
        return self.total_turns / self.total_conversations

Collecting Metrics

The metrics collector processes every completed conversation and updates aggregate numbers. It also breaks down metrics by intent category so you can see which topics the agent handles well and which need improvement.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

from collections import defaultdict

class MetricsCollector:
    def __init__(self):
        self.overall = SupportMetrics()
        self.by_intent: dict[str, SupportMetrics] = defaultdict(
            SupportMetrics
        )

    def record(self, conversation: ConversationRecord):
        for metrics in [self.overall, self.by_intent[conversation.intent]]:
            metrics.total_conversations += 1
            metrics.total_turns += conversation.turn_count
            metrics.total_first_response_ms += conversation.first_response_ms

            if conversation.outcome == ConversationOutcome.RESOLVED_BY_AI:
                metrics.resolved_by_ai += 1
            elif conversation.outcome == ConversationOutcome.ESCALATED_TO_HUMAN:
                metrics.escalated += 1
            elif conversation.outcome == ConversationOutcome.ABANDONED:
                metrics.abandoned += 1

            if conversation.resolution_confirmed:
                metrics.confirmed_resolutions += 1

            if conversation.csat_score is not None:
                metrics.total_csat += conversation.csat_score
                metrics.csat_responses += 1

    def generate_report(self) -> dict:
        report = {
            "overall": {
                "containment_rate": f"{self.overall.containment_rate:.1%}",
                "escalation_rate": f"{self.overall.escalation_rate:.1%}",
                "abandonment_rate": f"{self.overall.abandonment_rate:.1%}",
                "avg_csat": f"{self.overall.avg_csat:.2f}/5.0",
                "resolution_rate": f"{self.overall.resolution_rate:.1%}",
                "avg_first_response": f"{self.overall.avg_first_response_ms}ms",
                "avg_turns": f"{self.overall.avg_turns:.1f}",
            },
            "by_intent": {},
        }
        for intent, metrics in self.by_intent.items():
            report["by_intent"][intent] = {
                "containment_rate": f"{metrics.containment_rate:.1%}",
                "avg_csat": f"{metrics.avg_csat:.2f}/5.0",
                "volume": metrics.total_conversations,
            }
        return report

Benchmarking Against Targets

Set target thresholds for each metric and track whether the agent is meeting them. This makes it easy to identify areas that need attention.

@dataclass
class PerformanceTarget:
    metric: str
    target: float
    warning: float
    critical: float

DEFAULT_TARGETS = [
    PerformanceTarget("containment_rate", 0.70, 0.60, 0.50),
    PerformanceTarget("avg_csat", 4.0, 3.5, 3.0),
    PerformanceTarget("resolution_rate", 0.85, 0.75, 0.65),
    PerformanceTarget("abandonment_rate", 0.10, 0.15, 0.20),
    PerformanceTarget("avg_first_response_ms", 2000, 4000, 6000),
]

class PerformanceMonitor:
    def __init__(self, targets: list[PerformanceTarget] = None):
        self.targets = targets or DEFAULT_TARGETS

    def evaluate(self, metrics: SupportMetrics) -> list[dict]:
        results = []
        metric_values = {
            "containment_rate": metrics.containment_rate,
            "avg_csat": metrics.avg_csat,
            "resolution_rate": metrics.resolution_rate,
            "abandonment_rate": metrics.abandonment_rate,
            "avg_first_response_ms": metrics.avg_first_response_ms,
        }

        for target in self.targets:
            value = metric_values.get(target.metric, 0)
            # For abandonment_rate and response time, lower is better
            if target.metric in ("abandonment_rate", "avg_first_response_ms"):
                if value <= target.target:
                    status = "healthy"
                elif value <= target.warning:
                    status = "warning"
                else:
                    status = "critical"
            else:
                if value >= target.target:
                    status = "healthy"
                elif value >= target.warning:
                    status = "warning"
                else:
                    status = "critical"

            results.append({
                "metric": target.metric,
                "value": value,
                "target": target.target,
                "status": status,
            })
        return results

Improvement Loops

Metrics without action are just dashboards. Build automated improvement loops that identify the weakest areas and generate actionable recommendations.

class ImprovementEngine:
    def __init__(self, collector: MetricsCollector):
        self.collector = collector

    def identify_weakest_intents(self, top_n: int = 3) -> list[dict]:
        intents = []
        for intent, metrics in self.collector.by_intent.items():
            if metrics.total_conversations < 10:
                continue
            intents.append({
                "intent": intent,
                "containment": metrics.containment_rate,
                "csat": metrics.avg_csat,
                "volume": metrics.total_conversations,
                "score": (
                    metrics.containment_rate * 0.4
                    + (metrics.avg_csat / 5) * 0.4
                    + (1 - metrics.abandonment_rate) * 0.2
                ),
            })
        intents.sort(key=lambda i: i["score"])
        return intents[:top_n]

    def recommend_actions(self) -> list[str]:
        actions = []
        weak = self.identify_weakest_intents()
        for item in weak:
            if item["containment"] < 0.5:
                actions.append(
                    f"Intent '{item['intent']}': Low containment "
                    f"({item['containment']:.0%}). Review knowledge "
                    f"base coverage and add missing articles."
                )
            if item["csat"] < 3.5:
                actions.append(
                    f"Intent '{item['intent']}': Low CSAT "
                    f"({item['csat']:.1f}). Review conversation "
                    f"transcripts for tone and accuracy issues."
                )
        return actions

FAQ

What is a good containment rate for an AI support agent?

Industry benchmarks for AI support containment range from 60% to 80%. Start with a 65% target and improve from there. Below 50% means the AI is essentially a receptionist, not a resolver. Above 80% is excellent but verify with CSAT — high containment with low satisfaction means the agent is closing conversations without actually helping.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

How do I collect CSAT from AI-handled conversations?

Send a one-question survey at the end of resolved conversations: "How satisfied were you with the support you received? (1-5)". Keep it simple — multi-question surveys get low response rates. You will typically see 15-25% response rates, which is enough for statistically meaningful analysis once you have a few hundred conversations.

How often should I review support agent metrics?

Review the dashboard daily for anomalies (sudden drops in containment or CSAT), weekly for trend analysis, and monthly for strategic improvements. The daily check catches outages and misconfigurations. The weekly review identifies gradual degradation. The monthly review drives knowledge base expansion and model improvements.

#SupportMetrics #CSAT #PerformanceMeasurement #AIAnalytics #CustomerSupport #AgenticAI #LearnAI #AIEngineering

Measuring AI Support Agent Performance: CSAT, Resolution Rate, and Containment

You Cannot Improve What You Do Not Measure

Defining the Core KPIs

Collecting Metrics

Benchmarking Against Targets

Improvement Loops

FAQ

What is a good containment rate for an AI support agent?

How do I collect CSAT from AI-handled conversations?

How often should I review support agent metrics?

Try CallSphere AI Voice Agents

Related Articles You May Like

Chat Agents With Inline Surveys and Star Ratings: CSAT and NPS Without Friction in 2026

GPT-5.5 vs Claude Opus 4.7 for Customer Support and Vertical AI Products

Customer-Support Agent Memory and Conversation History Patterns

Agent Adoption by Job Function: Sales, Support, Finance, HR, and Engineering Data

AI Voice Agents with Multilingual Support for Global Teams

Web Chat Widget Shipped: CallSphere vs Vapi Voice-Only Limitation