Skip to content
Technical Guides
Technical Guides13 min read21 views

AI Voice Agent Analytics: The KPIs That Actually Matter

The 15 KPIs that matter for AI voice agent operations — from answer rate and FCR to cost per successful resolution.

If you are not measuring these, you are guessing

Voice agent dashboards tend to show whatever was easiest to build — total calls, total minutes, maybe sentiment. None of those tell you whether the agent is good at its job. This post lays out the 15 KPIs that actually matter for operating an AI voice agent and shows how to compute each one against a standard call log schema.

Every metric answers a question:
  • Did callers reach us?
  • Did the agent solve their problem?
  • How much did it cost?
  • Did anything go wrong?

Architecture overview

┌────────────────────┐
│ Voice agent runtime│
└─────────┬──────────┘
          │ call events
          ▼
┌────────────────────┐
│ calls table (OLTP) │
└─────────┬──────────┘
          │ CDC / copy
          ▼
┌────────────────────┐
│ analytics store    │
│ (ClickHouse / BQ)  │
└─────────┬──────────┘
          │
          ▼
┌────────────────────┐
│ dashboards + alerts│
└────────────────────┘

Prerequisites

  • A calls table with at minimum: call_id, started_at, ended_at, duration_sec, outcome, escalated, language, cost_cents.
  • A call_turns table with transcripts.
  • A call_events table (or enum column) with outcomes like resolved, escalated, abandoned.

The 15 KPIs

1. Answer rate

Percentage of inbound attempts that the agent actually picked up.

flowchart LR
    APP(["Agent or API"])
    SDK["OTel SDK<br/>GenAI conventions"]
    COL["OTel Collector"]
    subgraph BACKENDS["Backends"]
        TR[("Traces<br/>Tempo or Honeycomb")]
        MET[("Metrics<br/>Prometheus")]
        LOG[("Logs<br/>Loki or ELK")]
    end
    DASH["Grafana plus alerts"]
    PAGE(["Pager"])
    APP --> SDK --> COL
    COL --> TR
    COL --> MET
    COL --> LOG
    TR --> DASH
    MET --> DASH
    LOG --> DASH
    DASH --> PAGE
    style SDK fill:#4f46e5,stroke:#4338ca,color:#fff
    style DASH fill:#f59e0b,stroke:#d97706,color:#1f2937
    style PAGE fill:#dc2626,stroke:#b91c1c,color:#fff
SELECT
  COUNT(*) FILTER (WHERE status = 'answered') * 1.0 / COUNT(*) AS answer_rate
FROM calls
WHERE started_at >= now() - interval '7 days';

2. Time to first word

How long from ring to the first syllable of the agent's greeting.

3. Average handle time (AHT)

4. First-contact resolution (FCR)

SELECT
  COUNT(*) FILTER (WHERE outcome = 'resolved' AND NOT followup_required) * 1.0 / COUNT(*) AS fcr
FROM calls;

5. Escalation rate

6. Containment rate

Inverse of escalation — the percentage of calls fully handled by the agent.

7. Abandon rate

8. Booking rate (for scheduling verticals)

9. Sentiment score

Aggregate from the post-call pipeline.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

10. Cost per successful resolution

SELECT
  SUM(cost_cents) / NULLIF(SUM(CASE WHEN outcome = 'resolved' THEN 1 ELSE 0 END), 0) AS cpsr
FROM calls;

11. STT word error rate (WER)

Sample 1% of calls, have humans transcribe, compare.

12. Tool call success rate

13. Hallucination flag rate

From the post-call QA pipeline.

14. CSAT (when available)

15. Latency p95

Step-by-step walkthrough

1. Standardize the call log schema

CREATE TABLE calls (
  call_id TEXT PRIMARY KEY,
  started_at TIMESTAMPTZ NOT NULL,
  ended_at TIMESTAMPTZ,
  duration_sec INT,
  status TEXT NOT NULL,
  outcome TEXT,
  escalated BOOLEAN DEFAULT FALSE,
  followup_required BOOLEAN DEFAULT FALSE,
  language TEXT,
  cost_cents INT,
  agent_version TEXT
);

2. Compute metrics in batches

Run a 5-minute rollup job for dashboards and an hourly rollup for historical trends.

3. Set SLOs and alert on p95

4. Expose the metrics in an admin UI

async function fetchKpis(from: string, to: string) {
  return await db.oneOrNone(
    "SELECT * FROM kpi_rollup WHERE period_start >= $1 AND period_end <= $2",
    [from, to],
  );
}

5. Build an evaluation harness

Take real calls, mask PII, and replay them against a staging agent to compare FCR and AHT across prompt versions.

Production considerations

  • Sampling: WER and hallucination checks need human labelers; sample, do not inspect all.
  • Cost attribution: Realtime API + TTS + Twilio + STT all contribute; track separately.
  • Version pinning: record which agent version handled each call for A/B comparisons.
  • PII in dashboards: mask caller IDs and names at the dashboard layer.
  • Retention: raw transcripts are sensitive; delete or tokenize after 30-90 days depending on vertical.

CallSphere's real implementation

CallSphere runs a GPT-4o-mini post-call analytics pipeline that writes sentiment, intent, lead score, satisfaction, and escalation flags into per-vertical Postgres databases. Those columns feed the 15 KPIs above in an admin dashboard every customer gets access to. The live voice plane runs the OpenAI Realtime API (gpt-4o-realtime-preview-2025-06-03) at 24kHz PCM16 with server VAD.

Across 14 healthcare tools, 10 real estate agents, 4 salon agents, 7 after-hours escalation tools, 10-plus-RAG IT helpdesk tools, and the 5-specialist ElevenLabs sales pod, KPIs are computed identically so customers can compare performance across verticals. The OpenAI Agents SDK orchestrates handoffs. CallSphere runs 57+ languages and sub-second end-to-end latency.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Common pitfalls

  • Averaging everything: p95 is what customers feel.
  • Counting minutes, not outcomes: minutes do not pay the bills, resolutions do.
  • Ignoring hallucination rate: it is the single biggest trust killer.
  • Skipping version tags: you cannot prove a prompt improvement without them.
  • Dashboards nobody looks at: build alerts before dashboards.

FAQ

What is a good FCR for an AI voice agent?

60-80% for well-scoped verticals, lower for open-ended support.

How do I measure CSAT without a post-call survey?

Use the GPT-4o-mini satisfaction score on the transcript as a proxy, validated by periodic real surveys.

What is a reasonable answer-rate target?

95% for always-on agents; the rest are config errors or carrier outages.

How do I avoid biasing the post-call LLM scorer?

Run it blind to agent version and spot-check with humans.

Can I compare my agent to humans directly?

Only against matched caller intents and with the same KPI definitions.

Next steps

Want a dashboard wired to real voice-agent KPIs? Book a demo, read the technology page, or see pricing.

#CallSphere #Analytics #KPIs #VoiceAI #Observability #Metrics #AIVoiceAgents

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Infrastructure

Monitoring WebSocket Health: Heartbeats and Prometheus in 2026

How to actually observe a WebSocket fleet: ping/pong heartbeats, Prometheus metrics that matter, dead-man switches, and the alerts that fire before customers notice.

Agentic AI

The Agent Evaluation Stack in 2026: From Trace to Eval Score

How the modern agent eval stack actually flows: instrument, trace, dataset, evaluator, score, CI gate. The full pipeline that keeps agents from regressing.

AI Voice Agents

MOS Call Quality Scoring for AI Voice Operations in 2026: Beyond 4.2

MOS 4.3+ is the band where AI voice feels human. Drop below 3.6 and conversations break. Here is how to measure, improve, and alert on MOS in production AI voice using G.711, Opus, and the underlying packet loss / jitter / latency math.

AI Voice Agents

Call Sentiment Time-Series Dashboards for Voice AI in 2026

Sentiment is not a single number per call - it is a curve. The shape (started positive, dropped at minute 4, recovered) tells you what your AI did wrong. Here is the per-utterance sentiment pipeline and the dashboards we ship by vertical.

AI Engineering

Arize Phoenix: Open-Source LLM Tracing in 2026 Reviewed Honestly

Arize Phoenix is the open-source LLM observability tool that grew up significantly in 2026. Tracing, evals, and the OTel-native approach that makes Phoenix portable.

AI Engineering

Langfuse 2026 Update: Evals, Prompt Management, and Datasets Mature

Langfuse's April 2026 release ships online evals, prompt versioning, and dataset workflows. Why self-hosted observability is worth the operational lift in 2026 builds.