Skip to content
Technical Guides
Technical Guides14 min read24 views

Voice AI Latency: Why Sub-Second Response Time Matters (And How to Hit It)

A technical breakdown of voice AI latency budgets — STT, LLM, TTS, network — and how to hit sub-second end-to-end response times.

The conversational cliff

Humans expect a reply within roughly 500-700ms in natural conversation. Push past one second and callers feel like they are talking to a computer. Push past two seconds and they start talking over the agent and abandoning the call. Latency is not a nice-to-have in voice AI; it is the single biggest determinant of whether the conversation feels real.

This post walks through the full latency budget for a modern voice agent and the techniques that get you reliably under one second.

total = network + vad + stt + llm_first_token + llm_reasoning + tts_first_frame + playback

Architecture overview

caller                                           time budget
  │
  ├─► network_in          ─────►  40ms
  ├─► VAD decision        ─────► 150ms
  ├─► STT partial         ─────► 150ms (overlaps VAD)
  ├─► LLM first token     ─────► 250ms
  ├─► LLM finish          ─────► 150ms (streams during TTS)
  ├─► TTS first audio     ─────► 120ms
  ├─► network_out         ─────►  40ms
  └─► speaker             ─────►
                             ─────────
                   total  →   ~750ms

Prerequisites

  • A working voice agent pipeline.
  • An OpenTelemetry tracing backend (Honeycomb, Tempo, Jaeger).
  • The ability to measure wall-clock times at every boundary.

Step-by-step walkthrough

1. Instrument every stage with spans

from opentelemetry import trace
tracer = trace.get_tracer("voice-agent")

async def handle_turn(audio_in):
    with tracer.start_as_current_span("turn") as span:
        with tracer.start_as_current_span("vad"):
            ...  # VAD decision
        with tracer.start_as_current_span("stt"):
            ...
        with tracer.start_as_current_span("llm_first_token"):
            ...
        with tracer.start_as_current_span("tts_first_frame"):
            ...

2. Use streaming everything

Never wait for a stage to finish before starting the next. STT should emit partials, the LLM should stream tokens, TTS should stream audio frames. The end-of-turn signal is the only blocking event.

flowchart LR
    REQ(["Request"])
    BATCH["Continuous batching<br/>vLLM scheduler"]
    PREF{"Prefill or<br/>decode?"}
    PRE["Prefill phase<br/>parallel attention"]
    DEC["Decode phase<br/>token by token"]
    KV[("Paged KV cache")]
    SAMP["Sampling<br/>top-p, temp"]
    STREAM["Stream tokens<br/>to client"]
    REQ --> BATCH --> PREF
    PREF -->|First token| PRE --> KV
    PREF -->|Next token| DEC
    KV --> DEC --> SAMP --> STREAM
    SAMP -->|EOS| DONE(["Response complete"])
    style BATCH fill:#4f46e5,stroke:#4338ca,color:#fff
    style KV fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style STREAM fill:#0ea5e9,stroke:#0369a1,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff

3. Collapse the pipeline

The OpenAI Realtime API removes three network hops by doing STT, LLM, and TTS in one WebSocket. That alone saves 200-400ms versus a DIY stack of Whisper + GPT + ElevenLabs as separate HTTP calls.

ws.send(JSON.stringify({
  type: "session.update",
  session: {
    turn_detection: { type: "server_vad", silence_duration_ms: 400 },
    input_audio_format: "pcm16",
    output_audio_format: "pcm16",
  },
}));

4. Prewarm everything

At call setup, open the Realtime WebSocket before the caller says "hello". The TLS handshake and model load dominate first-turn latency otherwise.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
async def on_incoming_ring(call_sid: str):
    session = await open_realtime_session()  # TLS + handshake now, not mid-call
    sessions[call_sid] = session

5. Keep tool calls off the hot path when possible

If a tool call takes >300ms, the agent should speak a filler ("let me pull that up") and stream it while the tool runs. The Realtime API makes this easy with response.create plus an instructions override.

6. Measure p50, p95, and p99 separately

Average latency hides the calls that feel broken. Track percentiles per stage and alert on p95.

Production considerations

  • Geography: keep the edge, the model, and the carrier in the same region. Cross-region adds 60-150ms.
  • Cold starts: if you run on serverless, warm pools are mandatory.
  • Network path: use private connectivity to your carrier if they offer it.
  • GC pauses: Node and Python both have them; profile under load.
  • Audio codec conversion: each resample costs 5-15ms. Do it once per direction.

CallSphere's real implementation

CallSphere targets and maintains sub-one-second end-to-end response time across every production vertical. The voice plane runs on the OpenAI Realtime API with gpt-4o-realtime-preview-2025-06-03, PCM16 at 24kHz, and server VAD — a single WebSocket per call, pre-warmed at ring, terminated at a FastAPI edge co-located with Twilio's media region.

The multi-agent topologies — 14 tools for healthcare, 10 for real estate, 4 for salon, 7 for after-hours escalation, 10 plus RAG for IT helpdesk, and the 5-specialist ElevenLabs sales pod — are all orchestrated through the OpenAI Agents SDK. Handoffs between agents reuse the same session so there is no TLS renegotiation mid-call, and post-call analytics from a GPT-4o-mini pipeline run asynchronously so they never contend with the hot audio path. CallSphere supports 57+ languages with the same budget.

Common pitfalls

  • Buffering audio for "smoothing": it adds latency for negligible quality gain.
  • Running STT in a separate HTTP request: you lose streaming.
  • Serial tool calls: parallelize them when the arguments are independent.
  • Logging in the hot path: async log emit, never block.
  • Ignoring p99: a 5% of calls that feel broken is a 5% churn signal.

FAQ

What is a realistic target?

Under 1 second at p50, under 1.4 seconds at p95.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Does the LLM model size matter?

Yes, but less than you think. The Realtime API's gpt-4o variant is already tuned for low first-token latency.

How much does TLS handshake cost?

40-120ms the first time, free on reuse.

Is WebRTC faster than Twilio Media Streams?

Marginally, because WebRTC uses UDP. Twilio over WebSocket is still plenty fast for production.

Can I reduce latency by running a local model?

Only if your local model beats the Realtime API end-to-end, which is rarely true today.

Next steps

Want to measure latency on your current stack? Book a demo to see how CallSphere hits sub-second on live traffic, read the technology page, or compare pricing.

#CallSphere #Latency #VoiceAI #Performance #OpenAIRealtime #Observability #AIVoiceAgents

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Infrastructure

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

WebTransport is Baseline as of March 2026. Media Over QUIC ships in production within the year. Here is what changes for AI voice agents — and what stays the same.

AI Engineering

Latency Benchmarking AI Voice Agent Vendors (2026)

Vapi 465ms optimal, Retell 580-620ms, Bland ~800ms, ElevenLabs 400-600ms — but those are best-case. We design a fair benchmark harness, P95 measurement, and a reproducible methodology for 2026.

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

Agentic AI

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

Build a working computer-use agent with the OpenAI Computer Use tool — clicks, types, scrolls a real browser — then evaluate task success on a benchmark suite.

Agentic AI

Streaming Agent Responses with OpenAI Agents SDK and LangChain in 2026

How to stream tokens, tool-call deltas, and intermediate steps from an agent — with code for both the OpenAI Agents SDK and LangChain — and the gotchas that bite in production.

Agentic AI

Browser Agents with LangGraph + Playwright: Visual Evaluation Pipelines That Don't Lie

Build a browser agent with LangGraph and Playwright that does multi-step web tasks, then ground-truth its work with visual diffs and DOM-based evaluators.