Latency Benchmarking for AI Agents: Measuring Time-to-First-Token and Total Response Time

Why Latency Matters More Than You Think

Users tolerate a slow webpage for a few seconds. They abandon a slow conversational agent in moments. Research consistently shows that perceived agent intelligence drops when response times increase — the same answer delivered in 800 milliseconds feels smarter than the same answer delivered in 8 seconds. For AI agents, latency is not just a performance metric. It directly impacts perceived quality.

Agent latency is also more complex than web latency. A single response might involve an LLM call, two tool executions, another LLM call to synthesize results, and a final formatting step. You need to measure each segment independently to know where your time goes.

Defining Measurement Points

Agent latency has multiple stages. Instrument each one separately.

flowchart LR
    REQ(["Request"])
    BATCH["Continuous batching<br/>vLLM scheduler"]
    PREF{"Prefill or<br/>decode?"}
    PRE["Prefill phase<br/>parallel attention"]
    DEC["Decode phase<br/>token by token"]
    KV[("Paged KV cache")]
    SAMP["Sampling<br/>top-p, temp"]
    STREAM["Stream tokens<br/>to client"]
    REQ --> BATCH --> PREF
    PREF -->|First token| PRE --> KV
    PREF -->|Next token| DEC
    KV --> DEC --> SAMP --> STREAM
    SAMP -->|EOS| DONE(["Response complete"])
    style BATCH fill:#4f46e5,stroke:#4338ca,color:#fff
    style KV fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style STREAM fill:#0ea5e9,stroke:#0369a1,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

import time
from dataclasses import dataclass, field
from typing import Optional
from enum import Enum

class LatencyStage(Enum):
    PREPROCESSING = "preprocessing"
    LLM_FIRST_TOKEN = "llm_first_token"
    LLM_COMPLETE = "llm_complete"
    TOOL_EXECUTION = "tool_execution"
    POSTPROCESSING = "postprocessing"
    TOTAL = "total"

@dataclass
class LatencyMeasurement:
    stage: LatencyStage
    duration_ms: float
    metadata: dict = field(default_factory=dict)

class LatencyTracer:
    def __init__(self):
        self.measurements: list[LatencyMeasurement] = []
        self._timers: dict[str, float] = {}
        self._total_start: Optional[float] = None

    def start_total(self):
        self._total_start = time.perf_counter()

    def start(self, stage: LatencyStage):
        self._timers[stage.value] = time.perf_counter()

    def stop(
        self, stage: LatencyStage, metadata: dict = None
    ):
        if stage.value not in self._timers:
            return
        elapsed = (
            time.perf_counter() - self._timers[stage.value]
        ) * 1000
        self.measurements.append(LatencyMeasurement(
            stage=stage,
            duration_ms=round(elapsed, 2),
            metadata=metadata or {},
        ))
        del self._timers[stage.value]

    def stop_total(self):
        if self._total_start is None:
            return
        elapsed = (
            time.perf_counter() - self._total_start
        ) * 1000
        self.measurements.append(LatencyMeasurement(
            stage=LatencyStage.TOTAL,
            duration_ms=round(elapsed, 2),
        ))

    def summary(self) -> dict[str, float]:
        return {
            m.stage.value: m.duration_ms
            for m in self.measurements
        }

Use time.perf_counter() rather than time.time() — it provides monotonic, high-resolution timing that is not affected by system clock adjustments.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Measuring Time-to-First-Token

Time-to-first-token (TTFT) is the most important latency metric for user experience. It determines how long the user stares at a blank screen before seeing any response.

import asyncio

async def measure_ttft(
    llm_client,
    messages: list[dict],
    model: str = "gpt-4o",
) -> dict:
    start = time.perf_counter()
    first_token_time = None
    full_response = []

    stream = await llm_client.chat.completions.create(
        model=model,
        messages=messages,
        stream=True,
    )

    async for chunk in stream:
        if chunk.choices and chunk.choices[0].delta.content:
            if first_token_time is None:
                first_token_time = time.perf_counter()
            full_response.append(
                chunk.choices[0].delta.content
            )

    end = time.perf_counter()
    ttft = (
        (first_token_time - start) * 1000
        if first_token_time
        else None
    )

    return {
        "ttft_ms": round(ttft, 2) if ttft else None,
        "total_ms": round((end - start) * 1000, 2),
        "token_count": len(full_response),
        "tokens_per_second": round(
            len(full_response) / (end - start), 1
        ) if end > start else 0,
    }

TTFT under 500 milliseconds feels instant to users. Between 500ms and 1500ms is noticeable but acceptable. Above 2 seconds, you need a loading indicator or progressive streaming to maintain engagement.

Percentile-Based Reporting

Averages hide the worst experiences. Report latency using percentiles.

import statistics
from typing import Sequence

def latency_percentiles(
    measurements_ms: Sequence[float],
) -> dict:
    if not measurements_ms:
        return {}

    sorted_ms = sorted(measurements_ms)
    n = len(sorted_ms)

    def percentile(p: float) -> float:
        idx = int(p / 100 * n)
        idx = min(idx, n - 1)
        return round(sorted_ms[idx], 2)

    return {
        "count": n,
        "p50": percentile(50),
        "p75": percentile(75),
        "p90": percentile(90),
        "p95": percentile(95),
        "p99": percentile(99),
        "mean": round(statistics.mean(sorted_ms), 2),
        "stdev": round(
            statistics.stdev(sorted_ms), 2
        ) if n > 1 else 0.0,
        "min": round(sorted_ms[0], 2),
        "max": round(sorted_ms[-1], 2),
    }

Focus your SLA on p95 or p99, not the mean. If your p50 is 400ms but your p99 is 12 seconds, one in a hundred users is having a terrible experience and your average hides it completely.

SLA Compliance Tracking

Define latency SLAs per operation type and track compliance rates.

@dataclass
class LatencySLA:
    operation: str
    target_ms: float
    percentile: float  # e.g., 95.0 for p95

class SLATracker:
    def __init__(self):
        self.slas: list[LatencySLA] = []
        self.measurements: dict[str, list[float]] = {}

    def register_sla(self, sla: LatencySLA):
        self.slas.append(sla)
        self.measurements.setdefault(sla.operation, [])

    def record(self, operation: str, latency_ms: float):
        if operation in self.measurements:
            self.measurements[operation].append(latency_ms)

    def compliance_report(self) -> list[dict]:
        report = []
        for sla in self.slas:
            data = self.measurements.get(sla.operation, [])
            if not data:
                report.append({
                    "operation": sla.operation,
                    "status": "no_data",
                })
                continue

            percs = latency_percentiles(data)
            p_key = f"p{int(sla.percentile)}"
            actual = percs.get(p_key, 0)
            compliant = actual <= sla.target_ms

            report.append({
                "operation": sla.operation,
                "sla_target_ms": sla.target_ms,
                "sla_percentile": sla.percentile,
                "actual_ms": actual,
                "compliant": compliant,
                "margin_ms": round(sla.target_ms - actual, 2),
                "sample_count": len(data),
            })
        return report

When compliance margin turns negative, you know exactly which operation is breaching its SLA and by how much. This drives targeted optimization rather than guessing.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Common Latency Optimization Strategies

Once you know where your time goes, apply targeted fixes. Preprocessing overhead can often be reduced by caching prompt templates. Tool execution latency drops with parallel tool calls when tools are independent. LLM latency improves with shorter prompts, smaller models for simple tasks, or prompt caching features offered by providers.

FAQ

What is a reasonable TTFT target for a production AI agent?

For chat-based agents, target a TTFT under 800 milliseconds at p95. For voice agents, you need under 500 milliseconds to feel conversational. If your agent uses tool calls before responding, consider sending a "thinking" indicator while tools execute, then stream the final answer. Users tolerate delays better when they see progress.

Should I measure latency in my evaluation pipeline or in production?

Both, but they measure different things. Evaluation pipeline latency tells you how fast the model and tools can run under controlled conditions. Production latency includes network hops, load balancer overhead, queue wait times, and contention from concurrent requests. Your evaluation pipeline sets a floor, and production metrics tell you how far above that floor you actually are.

How do I handle latency spikes from upstream LLM providers?

Implement circuit breakers with fallback models. If your primary model's latency exceeds a threshold for three consecutive requests, route to a faster fallback model. Track provider latency separately from your own processing time so you can distinguish between problems you can fix and problems you need to route around.

#Latency #Performance #Benchmarking #SLA #Python #AgenticAI #LearnAI #AIEngineering

Latency Benchmarking for AI Agents: Measuring Time-to-First-Token and Total Response Time

Why Latency Matters More Than You Think

Defining Measurement Points

Measuring Time-to-First-Token

Percentile-Based Reporting

SLA Compliance Tracking

Common Latency Optimization Strategies

FAQ

What is a reasonable TTFT target for a production AI agent?

Should I measure latency in my evaluation pipeline or in production?

How do I handle latency spikes from upstream LLM providers?

Try CallSphere AI Voice Agents

Related Articles You May Like

Latency Benchmarking AI Voice Agent Vendors (2026)

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Token-Level Evaluation of Streaming Agents: TTFT, Stream Smoothness, and Mid-Stream Hallucination Detection

Streaming Agent Responses with OpenAI Agents SDK and LangChain in 2026

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough