Skip to content
Learn Agentic AI
Learn Agentic AI11 min read13 views

Latency Benchmarking for AI Agents: Measuring Time-to-First-Token and Total Response Time

A hands-on guide to measuring AI agent latency at every stage of the pipeline, from time-to-first-token through tool execution to total response time, with percentile reporting and SLA compliance tracking.

Why Latency Matters More Than You Think

Users tolerate a slow webpage for a few seconds. They abandon a slow conversational agent in moments. Research consistently shows that perceived agent intelligence drops when response times increase — the same answer delivered in 800 milliseconds feels smarter than the same answer delivered in 8 seconds. For AI agents, latency is not just a performance metric. It directly impacts perceived quality.

Agent latency is also more complex than web latency. A single response might involve an LLM call, two tool executions, another LLM call to synthesize results, and a final formatting step. You need to measure each segment independently to know where your time goes.

Defining Measurement Points

Agent latency has multiple stages. Instrument each one separately.

flowchart LR
    REQ(["Request"])
    BATCH["Continuous batching<br/>vLLM scheduler"]
    PREF{"Prefill or<br/>decode?"}
    PRE["Prefill phase<br/>parallel attention"]
    DEC["Decode phase<br/>token by token"]
    KV[("Paged KV cache")]
    SAMP["Sampling<br/>top-p, temp"]
    STREAM["Stream tokens<br/>to client"]
    REQ --> BATCH --> PREF
    PREF -->|First token| PRE --> KV
    PREF -->|Next token| DEC
    KV --> DEC --> SAMP --> STREAM
    SAMP -->|EOS| DONE(["Response complete"])
    style BATCH fill:#4f46e5,stroke:#4338ca,color:#fff
    style KV fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style STREAM fill:#0ea5e9,stroke:#0369a1,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
import time
from dataclasses import dataclass, field
from typing import Optional
from enum import Enum

class LatencyStage(Enum):
    PREPROCESSING = "preprocessing"
    LLM_FIRST_TOKEN = "llm_first_token"
    LLM_COMPLETE = "llm_complete"
    TOOL_EXECUTION = "tool_execution"
    POSTPROCESSING = "postprocessing"
    TOTAL = "total"

@dataclass
class LatencyMeasurement:
    stage: LatencyStage
    duration_ms: float
    metadata: dict = field(default_factory=dict)

class LatencyTracer:
    def __init__(self):
        self.measurements: list[LatencyMeasurement] = []
        self._timers: dict[str, float] = {}
        self._total_start: Optional[float] = None

    def start_total(self):
        self._total_start = time.perf_counter()

    def start(self, stage: LatencyStage):
        self._timers[stage.value] = time.perf_counter()

    def stop(
        self, stage: LatencyStage, metadata: dict = None
    ):
        if stage.value not in self._timers:
            return
        elapsed = (
            time.perf_counter() - self._timers[stage.value]
        ) * 1000
        self.measurements.append(LatencyMeasurement(
            stage=stage,
            duration_ms=round(elapsed, 2),
            metadata=metadata or {},
        ))
        del self._timers[stage.value]

    def stop_total(self):
        if self._total_start is None:
            return
        elapsed = (
            time.perf_counter() - self._total_start
        ) * 1000
        self.measurements.append(LatencyMeasurement(
            stage=LatencyStage.TOTAL,
            duration_ms=round(elapsed, 2),
        ))

    def summary(self) -> dict[str, float]:
        return {
            m.stage.value: m.duration_ms
            for m in self.measurements
        }

Use time.perf_counter() rather than time.time() — it provides monotonic, high-resolution timing that is not affected by system clock adjustments.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Measuring Time-to-First-Token

Time-to-first-token (TTFT) is the most important latency metric for user experience. It determines how long the user stares at a blank screen before seeing any response.

import asyncio

async def measure_ttft(
    llm_client,
    messages: list[dict],
    model: str = "gpt-4o",
) -> dict:
    start = time.perf_counter()
    first_token_time = None
    full_response = []

    stream = await llm_client.chat.completions.create(
        model=model,
        messages=messages,
        stream=True,
    )

    async for chunk in stream:
        if chunk.choices and chunk.choices[0].delta.content:
            if first_token_time is None:
                first_token_time = time.perf_counter()
            full_response.append(
                chunk.choices[0].delta.content
            )

    end = time.perf_counter()
    ttft = (
        (first_token_time - start) * 1000
        if first_token_time
        else None
    )

    return {
        "ttft_ms": round(ttft, 2) if ttft else None,
        "total_ms": round((end - start) * 1000, 2),
        "token_count": len(full_response),
        "tokens_per_second": round(
            len(full_response) / (end - start), 1
        ) if end > start else 0,
    }

TTFT under 500 milliseconds feels instant to users. Between 500ms and 1500ms is noticeable but acceptable. Above 2 seconds, you need a loading indicator or progressive streaming to maintain engagement.

Percentile-Based Reporting

Averages hide the worst experiences. Report latency using percentiles.

import statistics
from typing import Sequence

def latency_percentiles(
    measurements_ms: Sequence[float],
) -> dict:
    if not measurements_ms:
        return {}

    sorted_ms = sorted(measurements_ms)
    n = len(sorted_ms)

    def percentile(p: float) -> float:
        idx = int(p / 100 * n)
        idx = min(idx, n - 1)
        return round(sorted_ms[idx], 2)

    return {
        "count": n,
        "p50": percentile(50),
        "p75": percentile(75),
        "p90": percentile(90),
        "p95": percentile(95),
        "p99": percentile(99),
        "mean": round(statistics.mean(sorted_ms), 2),
        "stdev": round(
            statistics.stdev(sorted_ms), 2
        ) if n > 1 else 0.0,
        "min": round(sorted_ms[0], 2),
        "max": round(sorted_ms[-1], 2),
    }

Focus your SLA on p95 or p99, not the mean. If your p50 is 400ms but your p99 is 12 seconds, one in a hundred users is having a terrible experience and your average hides it completely.

SLA Compliance Tracking

Define latency SLAs per operation type and track compliance rates.

@dataclass
class LatencySLA:
    operation: str
    target_ms: float
    percentile: float  # e.g., 95.0 for p95

class SLATracker:
    def __init__(self):
        self.slas: list[LatencySLA] = []
        self.measurements: dict[str, list[float]] = {}

    def register_sla(self, sla: LatencySLA):
        self.slas.append(sla)
        self.measurements.setdefault(sla.operation, [])

    def record(self, operation: str, latency_ms: float):
        if operation in self.measurements:
            self.measurements[operation].append(latency_ms)

    def compliance_report(self) -> list[dict]:
        report = []
        for sla in self.slas:
            data = self.measurements.get(sla.operation, [])
            if not data:
                report.append({
                    "operation": sla.operation,
                    "status": "no_data",
                })
                continue

            percs = latency_percentiles(data)
            p_key = f"p{int(sla.percentile)}"
            actual = percs.get(p_key, 0)
            compliant = actual <= sla.target_ms

            report.append({
                "operation": sla.operation,
                "sla_target_ms": sla.target_ms,
                "sla_percentile": sla.percentile,
                "actual_ms": actual,
                "compliant": compliant,
                "margin_ms": round(sla.target_ms - actual, 2),
                "sample_count": len(data),
            })
        return report

When compliance margin turns negative, you know exactly which operation is breaching its SLA and by how much. This drives targeted optimization rather than guessing.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Common Latency Optimization Strategies

Once you know where your time goes, apply targeted fixes. Preprocessing overhead can often be reduced by caching prompt templates. Tool execution latency drops with parallel tool calls when tools are independent. LLM latency improves with shorter prompts, smaller models for simple tasks, or prompt caching features offered by providers.

FAQ

What is a reasonable TTFT target for a production AI agent?

For chat-based agents, target a TTFT under 800 milliseconds at p95. For voice agents, you need under 500 milliseconds to feel conversational. If your agent uses tool calls before responding, consider sending a "thinking" indicator while tools execute, then stream the final answer. Users tolerate delays better when they see progress.

Should I measure latency in my evaluation pipeline or in production?

Both, but they measure different things. Evaluation pipeline latency tells you how fast the model and tools can run under controlled conditions. Production latency includes network hops, load balancer overhead, queue wait times, and contention from concurrent requests. Your evaluation pipeline sets a floor, and production metrics tell you how far above that floor you actually are.

How do I handle latency spikes from upstream LLM providers?

Implement circuit breakers with fallback models. If your primary model's latency exceeds a threshold for three consecutive requests, route to a faster fallback model. Track provider latency separately from your own processing time so you can distinguish between problems you can fix and problems you need to route around.


#Latency #Performance #Benchmarking #SLA #Python #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Latency Benchmarking AI Voice Agent Vendors (2026)

Vapi 465ms optimal, Retell 580-620ms, Bland ~800ms, ElevenLabs 400-600ms — but those are best-case. We design a fair benchmark harness, P95 measurement, and a reproducible methodology for 2026.

AI Infrastructure

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

WebTransport is Baseline as of March 2026. Media Over QUIC ships in production within the year. Here is what changes for AI voice agents — and what stays the same.

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

Agentic AI

Token-Level Evaluation of Streaming Agents: TTFT, Stream Smoothness, and Mid-Stream Hallucination Detection

Streaming changes the eval game — final-answer correctness isn't enough when users perceive the answer one token at a time. Here's the metric set that matters.

Agentic AI

Streaming Agent Responses with OpenAI Agents SDK and LangChain in 2026

How to stream tokens, tool-call deltas, and intermediate steps from an agent — with code for both the OpenAI Agents SDK and LangChain — and the gotchas that bite in production.

Agentic AI

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.