Measuring Agent User Experience: CSAT, SUS, and Custom UX Metrics for AI Products

You Cannot Improve What You Cannot Measure

Building a great AI agent UX requires continuous measurement. Intuition and user complaints are not enough — you need quantitative metrics that track experience quality over time, surface regressions quickly, and provide actionable data for improvement.

AI agents present unique measurement challenges. Traditional web analytics (page views, click-through rates) do not capture conversational quality. You need a layered approach combining survey-based metrics, behavioral signals, and AI-specific quality indicators.

CSAT: Customer Satisfaction Score

CSAT is the most straightforward UX metric. Ask users to rate their experience on a 1-5 scale at the end of an interaction:

flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness<br/>PromptFoo or Braintrust"]
    GOLD[("Golden set<br/>200 tagged cases")]
    JUDGE["LLM as judge<br/>plus regex graders"]
    SCORE["Aggregate score<br/>and per slice"]
    GATE{"Score regress<br/>more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff

from dataclasses import dataclass
from datetime import datetime
from enum import Enum

class SurveyTrigger(Enum):
    TASK_COMPLETED = "task_completed"
    HUMAN_ESCALATION = "human_escalation"
    SESSION_END = "session_end"
    ERROR_RECOVERY = "error_recovery"

@dataclass
class CSATSurvey:
    conversation_id: str
    trigger: SurveyTrigger
    rating: int | None           # 1-5
    comment: str | None
    timestamp: datetime
    task_type: str
    turns_in_conversation: int

class CSATCollector:
    """Collect and analyze CSAT scores for agent interactions."""

    SURVEY_MESSAGES = {
        SurveyTrigger.TASK_COMPLETED: (
            "I'm glad I could help! On a scale of 1-5, "
            "how would you rate your experience today?"
        ),
        SurveyTrigger.HUMAN_ESCALATION: (
            "Before I transfer you, could you rate your experience "
            "with me so far? (1-5, 5 being excellent)"
        ),
        SurveyTrigger.ERROR_RECOVERY: (
            "I know we hit a bump earlier. Now that it's resolved, "
            "how would you rate the overall experience? (1-5)"
        ),
    }

    def should_survey(
        self,
        conversation_id: str,
        trigger: SurveyTrigger,
        recent_survey_count: int,
    ) -> bool:
        """Avoid survey fatigue — limit frequency."""
        if recent_survey_count >= 1:
            return False  # Max one survey per session
        if trigger == SurveyTrigger.SESSION_END:
            return True
        if trigger == SurveyTrigger.TASK_COMPLETED:
            return True
        return False

    def calculate_csat_score(self, surveys: list[CSATSurvey]) -> dict:
        """Calculate CSAT percentage (% of 4 and 5 ratings)."""
        rated = [s for s in surveys if s.rating is not None]
        if not rated:
            return {"score": None, "sample_size": 0}

        satisfied = sum(1 for s in rated if s.rating >= 4)
        return {
            "score": round((satisfied / len(rated)) * 100, 1),
            "sample_size": len(rated),
            "average_rating": round(
                sum(s.rating for s in rated) / len(rated), 2
            ),
        }

Target a CSAT score of 80% or higher. Below 70% indicates a systemic UX problem.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

System Usability Scale (SUS)

SUS is a standardized 10-question survey that produces a score from 0-100. It is ideal for periodic deep-dive assessments of your agent's usability:

SUS_QUESTIONS = [
    "I think I would like to use this AI assistant frequently.",
    "I found the AI assistant unnecessarily complex.",
    "I thought the AI assistant was easy to use.",
    "I think I would need technical support to use this assistant.",
    "I found the various capabilities were well integrated.",
    "I thought there was too much inconsistency in this assistant.",
    "I imagine most people would learn to use this assistant quickly.",
    "I found the assistant very cumbersome to use.",
    "I felt very confident using the assistant.",
    "I needed to learn a lot before I could use this assistant.",
]

# Questions alternate between positive and negative framing
POSITIVE_QUESTIONS = {0, 2, 4, 6, 8}  # 0-indexed

def calculate_sus_score(responses: list[int]) -> float:
    """
    Calculate SUS score from 10 responses (each 1-5).
    Score ranges from 0 to 100. Above 68 is above average.
    Above 80 is excellent.
    """
    if len(responses) != 10:
        raise ValueError("SUS requires exactly 10 responses")

    adjusted = []
    for i, response in enumerate(responses):
        if i in POSITIVE_QUESTIONS:
            adjusted.append(response - 1)      # Positive: score - 1
        else:
            adjusted.append(5 - response)      # Negative: 5 - score

    return sum(adjusted) * 2.5

def interpret_sus_score(score: float) -> str:
    if score >= 80.3:
        return "Excellent (Grade A)"
    elif score >= 68:
        return "Good (Grade C) — above average"
    elif score >= 51:
        return "OK (Grade D) — below average, needs improvement"
    else:
        return "Poor (Grade F) — significant usability issues"

Custom Behavioral Metrics for AI Agents

Survey metrics capture stated satisfaction. Behavioral metrics capture actual usage patterns:

@dataclass
class ConversationMetrics:
    conversation_id: str
    started_at: datetime
    ended_at: datetime
    total_turns: int
    user_turns: int
    agent_turns: int
    task_completed: bool
    escalated_to_human: bool
    errors_encountered: int
    errors_recovered: int
    clarification_questions_asked: int
    follow_up_prompts_clicked: int
    user_rephrased_count: int     # Times user had to rephrase
    time_to_first_value: float    # Seconds to first useful response
    idle_gaps: list[float]        # Seconds between user messages

def calculate_behavioral_health(
    metrics: list[ConversationMetrics],
) -> dict:
    """Calculate aggregate behavioral health indicators."""

    total = len(metrics)
    if total == 0:
        return {}

    task_completion_rate = (
        sum(1 for m in metrics if m.task_completed) / total * 100
    )

    escalation_rate = (
        sum(1 for m in metrics if m.escalated_to_human) / total * 100
    )

    avg_turns_to_completion = (
        sum(m.total_turns for m in metrics if m.task_completed)
        / max(sum(1 for m in metrics if m.task_completed), 1)
    )

    avg_rephrase_rate = (
        sum(m.user_rephrased_count for m in metrics) / total
    )

    avg_time_to_value = (
        sum(m.time_to_first_value for m in metrics) / total
    )

    error_recovery_rate = (
        sum(m.errors_recovered for m in metrics)
        / max(sum(m.errors_encountered for m in metrics), 1)
        * 100
    )

    return {
        "task_completion_rate": round(task_completion_rate, 1),
        "escalation_rate": round(escalation_rate, 1),
        "avg_turns_to_completion": round(avg_turns_to_completion, 1),
        "avg_rephrase_rate": round(avg_rephrase_rate, 2),
        "avg_time_to_value_seconds": round(avg_time_to_value, 1),
        "error_recovery_rate": round(error_recovery_rate, 1),
    }

Key thresholds to watch: task completion rate below 70% means the agent is failing its core job. Rephrase rate above 1.5 per conversation means the agent is not understanding users. Time to first value above 30 seconds means the onboarding or first response is too slow.

A/B Testing UX Changes

Test UX changes rigorously before rolling them out:

import hashlib
from dataclasses import dataclass

@dataclass
class ABTestConfig:
    test_id: str
    variants: dict[str, dict]   # variant_name -> config
    traffic_split: dict[str, float]  # variant_name -> percentage (0-1)
    primary_metric: str
    minimum_sample_size: int

def assign_variant(user_id: str, test: ABTestConfig) -> str:
    """Deterministically assign a user to a test variant."""
    hash_input = f"{test.test_id}:{user_id}"
    hash_value = int(hashlib.sha256(hash_input.encode()).hexdigest(), 16)
    bucket = (hash_value % 1000) / 1000.0

    cumulative = 0.0
    for variant, split in test.traffic_split.items():
        cumulative += split
        if bucket < cumulative:
            return variant

    return list(test.traffic_split.keys())[-1]

# Example: Testing a new greeting format
greeting_test = ABTestConfig(
    test_id="greeting_v2_2026_03",
    variants={
        "control": {
            "greeting_style": "list_capabilities",
            "max_greeting_length": 200,
        },
        "treatment": {
            "greeting_style": "single_question",
            "max_greeting_length": 50,
        },
    },
    traffic_split={"control": 0.5, "treatment": 0.5},
    primary_metric="task_completion_rate",
    minimum_sample_size=500,
)

Building an Analytics Dashboard

Aggregate all metrics into a single view that surfaces problems early:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

@dataclass
class AgentHealthDashboard:
    """Daily snapshot of agent UX health."""
    date: str
    csat_score: float
    task_completion_rate: float
    avg_turns_to_completion: float
    escalation_rate: float
    error_rate: float
    error_recovery_rate: float
    avg_time_to_value: float
    avg_rephrase_rate: float
    active_ab_tests: list[str]
    alerts: list[str]

def generate_daily_alerts(dashboard: AgentHealthDashboard) -> list[str]:
    """Generate alerts when metrics cross thresholds."""
    alerts = []

    if dashboard.csat_score < 70:
        alerts.append(
            f"CSAT dropped to {dashboard.csat_score}% — "
            "investigate recent changes"
        )

    if dashboard.task_completion_rate < 65:
        alerts.append(
            f"Task completion at {dashboard.task_completion_rate}% — "
            "check for broken flows"
        )

    if dashboard.escalation_rate > 30:
        alerts.append(
            f"Escalation rate at {dashboard.escalation_rate}% — "
            "agent may be failing common intents"
        )

    if dashboard.avg_rephrase_rate > 2.0:
        alerts.append(
            f"Users rephrasing {dashboard.avg_rephrase_rate}x on average — "
            "NLU needs tuning"
        )

    if dashboard.avg_time_to_value > 45:
        alerts.append(
            f"Time to value at {dashboard.avg_time_to_value}s — "
            "first response too slow"
        )

    return alerts

Wire these alerts into your team's notification system (Slack, PagerDuty) so regressions are caught the same day they happen.

FAQ

How often should I collect CSAT surveys without causing survey fatigue?

Limit surveys to one per user session and no more than once per week for the same user. Rotate between end-of-task surveys and periodic in-depth surveys (like SUS). A 10-15% survey response rate is normal for in-product surveys — do not try to survey everyone. If your response rate drops below 5%, your survey prompt is too intrusive or too frequent.

What is the most important single metric for agent UX?

Task completion rate. If users cannot complete the task they came for, no amount of personality, formatting, or speed matters. Track it by task type (order lookup, returns, FAQ) so you can identify which specific flows are broken. A high overall completion rate can mask a 20% completion rate on a specific task that affects thousands of users.

How do I isolate whether a UX change or a model change caused a metric shift?

Never ship a UX change and a model change simultaneously. If your A/B test changes the greeting format at the same time you update the underlying model, you cannot attribute the metric movement. Use staged rollouts: ship the model change first, let metrics stabilize for a week, then launch the UX A/B test. If you must do both, use a 2x2 factorial design (old model + old UX, old model + new UX, new model + old UX, new model + new UX) but this requires 4x the sample size.

#UXMetrics #CSAT #Analytics #AIAgents #ABTesting #AgenticAI #LearnAI #AIEngineering

Measuring Agent User Experience: CSAT, SUS, and Custom UX Metrics for AI Products

You Cannot Improve What You Cannot Measure

CSAT: Customer Satisfaction Score

System Usability Scale (SUS)

Custom Behavioral Metrics for AI Agents

A/B Testing UX Changes

Building an Analytics Dashboard

FAQ

How often should I collect CSAT surveys without causing survey fatigue?

What is the most important single metric for agent UX?

How do I isolate whether a UX change or a model change caused a metric shift?

Try CallSphere AI Voice Agents

Related Articles You May Like

Chat Agents With Inline Surveys and Star Ratings: CSAT and NPS Without Friction in 2026

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals