Model Latency Profiles by Provider: TTFT, TPS, and p99 in 2026
Headline tokens-per-second numbers hide what matters. The 2026 latency profiles by provider — TTFT, TPS, and p99 — for production planning.
What Latency Numbers Actually Matter
Three numbers per LLM provider:
- TTFT (Time to First Token): how long until generation starts
- TPS (Tokens Per Second): throughput once generation starts
- p99 latency: tail latency under load
Headline benchmarks usually publish only TPS at low concurrency. Production planning needs all three at realistic load.
The Three Metrics
flowchart LR
Req[Request] --> TTFT[TTFT: ms to first token]
TTFT --> Gen[Generation at TPS]
Gen --> Done[Done]
Spike[Tail load] --> P99[p99: how slow does it get under stress]
For UX, TTFT often matters more than TPS. A user who sees their first word in 200ms and the rest streamed will feel served; a user who waits 2 seconds for nothing then gets the full reply will not.
April 2026 Approximate Numbers
For typical mid-tier models in moderate load:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
| Provider | TTFT (ms) | TPS | p99 latency for 500-token response |
|---|---|---|---|
| OpenAI GPT-5 | 200-400 | 60-100 | 8-12s |
| OpenAI GPT-5-mini | 150-300 | 100-150 | 5-8s |
| Anthropic Sonnet 4.6 | 200-500 | 50-100 | 7-12s |
| Anthropic Haiku 4.5 | 100-250 | 100-180 | 4-7s |
| Gemini 2.5 Flash | 100-250 | 100-200 | 4-7s |
| Open-weights via Together | 150-400 | 80-150 | 5-10s |
These shift with load and region. Run your own benchmarks.
What Affects TTFT
- Region routing
- Cold-start vs warm
- Prompt length (prefill is part of TTFT)
- Cache hit (cached prefix has lower TTFT)
- Model size
What Affects TPS
- Model size
- Inference hardware
- Batch composition (your request shares with others)
- Speculative decoding (improves TPS substantially when enabled)
What Affects p99
- Provider's load shedding policies
- Your account's rate limit headroom
- Time of day (peak vs off-peak)
- Specific feature flags (reasoning mode is much slower)
Optimizing for TTFT
Since TTFT often dominates UX:
- Region-pin requests
- Use caching aggressively
- Pre-warm connections
- Pick models with lower TTFT for latency-sensitive paths
Optimizing for TPS
When generating long responses:
- Pick models with native fast generation (smaller, optimized)
- Stream output to UI
- Truncate output length where possible
- Use speculative decoding if available
Optimizing for p99
The hardest. Approaches:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
- Reserved capacity / committed throughput tiers
- Cross-provider failover for critical paths
- Backoff and retry with intelligent budget
- Dedicated capacity for premium customers
For p99 to be reliable, you typically need to pay for it (reserved capacity).
Latency Across Modalities
Voice has tight budgets:
- Realtime API TTFB: 200-400ms typical
- TTS streaming: 30-100ms first audio
- Together: 300-500ms perceived latency
Voice latency engineering is its own discipline.
What's Hidden
- Token counting differs by provider (your bill and the response length matter)
- Streaming behaviors differ (chunk size, frequency)
- Reasoning models show progress slowly
- Some providers add invisible "preamble" thinking
Practical Budget Setting
For a chat UI:
- TTFT < 500ms (perceived snappy)
- Total response < 5s for typical
- p99 < 10s
For a voice agent:
- TTFT < 300ms
- TTS streaming starts < 200ms
- Total interaction loop < 1s
Sources
- "Artificial Analysis" — https://artificialanalysis.ai
- OpenAI status / SLO docs — https://status.openai.com
- Anthropic SLO docs — https://docs.anthropic.com
- "LLM latency benchmarking" various — https://lmsys.org
- "Voice latency engineering" Daily.co — https://www.daily.co/blog
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.