AI Voice Agent Analytics: The KPIs That Actually Matter

If you are not measuring these, you are guessing

Voice agent dashboards tend to show whatever was easiest to build — total calls, total minutes, maybe sentiment. None of those tell you whether the agent is good at its job. This post lays out the 15 KPIs that actually matter for operating an AI voice agent and shows how to compute each one against a standard call log schema.

Every metric answers a question:
  • Did callers reach us?
  • Did the agent solve their problem?
  • How much did it cost?
  • Did anything go wrong?

Architecture overview

┌────────────────────┐
│ Voice agent runtime│
└─────────┬──────────┘
          │ call events
          ▼
┌────────────────────┐
│ calls table (OLTP) │
└─────────┬──────────┘
          │ CDC / copy
          ▼
┌────────────────────┐
│ analytics store    │
│ (ClickHouse / BQ)  │
└─────────┬──────────┘
          │
          ▼
┌────────────────────┐
│ dashboards + alerts│
└────────────────────┘

Prerequisites

A calls table with at minimum: call_id, started_at, ended_at, duration_sec, outcome, escalated, language, cost_cents.
A call_turns table with transcripts.
A call_events table (or enum column) with outcomes like resolved, escalated, abandoned.

The 15 KPIs

1. Answer rate

Percentage of inbound attempts that the agent actually picked up.

flowchart LR
    APP(["Agent or API"])
    SDK["OTel SDK<br/>GenAI conventions"]
    COL["OTel Collector"]
    subgraph BACKENDS["Backends"]
        TR[("Traces<br/>Tempo or Honeycomb")]
        MET[("Metrics<br/>Prometheus")]
        LOG[("Logs<br/>Loki or ELK")]
    end
    DASH["Grafana plus alerts"]
    PAGE(["Pager"])
    APP --> SDK --> COL
    COL --> TR
    COL --> MET
    COL --> LOG
    TR --> DASH
    MET --> DASH
    LOG --> DASH
    DASH --> PAGE
    style SDK fill:#4f46e5,stroke:#4338ca,color:#fff
    style DASH fill:#f59e0b,stroke:#d97706,color:#1f2937
    style PAGE fill:#dc2626,stroke:#b91c1c,color:#fff

SELECT
  COUNT(*) FILTER (WHERE status = 'answered') * 1.0 / COUNT(*) AS answer_rate
FROM calls
WHERE started_at >= now() - interval '7 days';

2. Time to first word

How long from ring to the first syllable of the agent's greeting.

3. Average handle time (AHT)

4. First-contact resolution (FCR)

SELECT
  COUNT(*) FILTER (WHERE outcome = 'resolved' AND NOT followup_required) * 1.0 / COUNT(*) AS fcr
FROM calls;

5. Escalation rate

6. Containment rate

Inverse of escalation — the percentage of calls fully handled by the agent.

7. Abandon rate

8. Booking rate (for scheduling verticals)

9. Sentiment score

Aggregate from the post-call pipeline.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

10. Cost per successful resolution

SELECT
  SUM(cost_cents) / NULLIF(SUM(CASE WHEN outcome = 'resolved' THEN 1 ELSE 0 END), 0) AS cpsr
FROM calls;

11. STT word error rate (WER)

Sample 1% of calls, have humans transcribe, compare.

12. Tool call success rate

13. Hallucination flag rate

From the post-call QA pipeline.

14. CSAT (when available)

15. Latency p95

Step-by-step walkthrough

1. Standardize the call log schema

CREATE TABLE calls (
  call_id TEXT PRIMARY KEY,
  started_at TIMESTAMPTZ NOT NULL,
  ended_at TIMESTAMPTZ,
  duration_sec INT,
  status TEXT NOT NULL,
  outcome TEXT,
  escalated BOOLEAN DEFAULT FALSE,
  followup_required BOOLEAN DEFAULT FALSE,
  language TEXT,
  cost_cents INT,
  agent_version TEXT
);

2. Compute metrics in batches

Run a 5-minute rollup job for dashboards and an hourly rollup for historical trends.

3. Set SLOs and alert on p95

4. Expose the metrics in an admin UI

async function fetchKpis(from: string, to: string) {
  return await db.oneOrNone(
    "SELECT * FROM kpi_rollup WHERE period_start >= $1 AND period_end <= $2",
    [from, to],
  );
}

5. Build an evaluation harness

Take real calls, mask PII, and replay them against a staging agent to compare FCR and AHT across prompt versions.

Production considerations

Sampling: WER and hallucination checks need human labelers; sample, do not inspect all.
Cost attribution: Realtime API + TTS + Twilio + STT all contribute; track separately.
Version pinning: record which agent version handled each call for A/B comparisons.
PII in dashboards: mask caller IDs and names at the dashboard layer.
Retention: raw transcripts are sensitive; delete or tokenize after 30-90 days depending on vertical.

CallSphere's real implementation

CallSphere runs a GPT-4o-mini post-call analytics pipeline that writes sentiment, intent, lead score, satisfaction, and escalation flags into per-vertical Postgres databases. Those columns feed the 15 KPIs above in an admin dashboard every customer gets access to. The live voice plane runs the OpenAI Realtime API (gpt-4o-realtime-preview-2025-06-03) at 24kHz PCM16 with server VAD.

Across 14 healthcare tools, 10 real estate agents, 4 salon agents, 7 after-hours escalation tools, 10-plus-RAG IT helpdesk tools, and the 5-specialist ElevenLabs sales pod, KPIs are computed identically so customers can compare performance across verticals. The OpenAI Agents SDK orchestrates handoffs. CallSphere runs 57+ languages and sub-second end-to-end latency.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Common pitfalls

Averaging everything: p95 is what customers feel.
Counting minutes, not outcomes: minutes do not pay the bills, resolutions do.
Ignoring hallucination rate: it is the single biggest trust killer.
Skipping version tags: you cannot prove a prompt improvement without them.
Dashboards nobody looks at: build alerts before dashboards.

FAQ

What is a good FCR for an AI voice agent?

60-80% for well-scoped verticals, lower for open-ended support.

How do I measure CSAT without a post-call survey?

Use the GPT-4o-mini satisfaction score on the transcript as a proxy, validated by periodic real surveys.

What is a reasonable answer-rate target?

95% for always-on agents; the rest are config errors or carrier outages.

How do I avoid biasing the post-call LLM scorer?

Run it blind to agent version and spot-check with humans.

Can I compare my agent to humans directly?

Only against matched caller intents and with the same KPI definitions.

Next steps

Want a dashboard wired to real voice-agent KPIs? Book a demo, read the technology page, or see pricing.

#CallSphere #Analytics #KPIs #VoiceAI #Observability #Metrics #AIVoiceAgents