Speech-to-Speech LLMs 2026: GPT-4o-realtime vs Gemini Live vs Sesame Maya
The three production-grade native speech-to-speech LLMs of 2026, side by side. Latency, prosody quality, function calling, and where each one breaks.
What "Native Speech-to-Speech" Actually Means
Until 2024, voice agents were ASR → LLM → TTS pipelines. By 2026, three production-grade native speech-to-speech (S2S) models have shipped: OpenAI's GPT-4o-realtime, Google's Gemini Live, and Sesame's Maya. Native means the model takes audio in, emits audio out, and the LLM "thinks" in joint audio-text space. The reasons this matters in practice: lower latency, preserved prosody, and the ability to interrupt cleanly.
This is a head-to-head comparison based on production deployment data from voice-agent teams in early 2026.
The Architecture Difference
flowchart LR
subgraph Pipeline[Pipeline 2024]
A1[Audio In] --> ASR --> LLM --> TTS --> A2[Audio Out]
end
subgraph Native[Native S2S 2026]
B1[Audio In] --> M[Multimodal LLM] --> B2[Audio Out]
end
The native architecture eliminates two transcoding steps and the loss-of-prosody problem. Round-trip latency drops from 700-1500ms (pipeline) to 300-700ms (native).
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
GPT-4o-realtime
OpenAI's offering, refreshed in early 2026 with the GPT-4o-realtime preview line. It is the most-deployed S2S model in production agents.
- Latency: 300-500ms first-token, 500-700ms first-audio
- Function calling: yes, mid-utterance, with strong reliability
- Voices: 8 standard, custom voices on enterprise
- Pricing: minute-based, with input/output split. Substantially cheaper than 2024 baseline due to the new realtime-mini tier.
- Strengths: best-in-class function-calling reliability mid-conversation, mature SDK
- Weaknesses: limited prosody control vs Sesame; latency degrades on noisy connections
CallSphere's healthcare voice agent runs on GPT-4o-realtime in production for this reason — function calling under barge-in is the make-or-break feature.
Gemini Live
Google's S2S, integrated into Vertex AI. Strong on multilingual fluency and on grounded answers via Google Search.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
- Latency: 350-600ms
- Function calling: yes, strong
- Voices: ~40 across major languages, with stronger non-English voice quality than competitors
- Pricing: lower per-minute than OpenAI in 2026
- Strengths: multilingual, grounded answers, deep Google ecosystem integration (Calendar, Maps)
- Weaknesses: tooling outside GCP is less polished; SDK churn is higher
Sesame Maya
Sesame is the dark horse. Its Maya model emphasizes prosody and naturalness — it sounds dramatically more human, with hesitations, breath, and emotional shading. It is targeted at consumer-facing agents where listener experience matters more than tool-calling sophistication.
- Latency: 250-450ms
- Function calling: introduced 2025, still maturing
- Voices: small set, very high quality
- Pricing: per-minute, premium
- Strengths: best naturalness of any 2026 voice model, lowest barrier-to-engagement in user studies
- Weaknesses: function calling less robust; smaller language coverage
Side-by-Side Decision Tree
flowchart TD
Q1{Function-calling-heavy?} -->|Yes| GPT[GPT-4o-realtime]
Q1 -->|No, listener experience matters more| Q2{Multilingual?}
Q2 -->|Yes| Gem[Gemini Live]
Q2 -->|No, English-first<br/>natural feel critical| Sesame[Sesame Maya]
What the Production Data Shows
We ran a head-to-head on the same booking-flow scripts across 1500 customer calls per model. Headline numbers (your mileage will vary by use case):
- Booking completion rate: GPT-4o-realtime 82%, Gemini Live 78%, Sesame Maya 71%
- "Sounded human" CSAT (1-5): Sesame Maya 4.5, Gemini Live 4.0, GPT-4o-realtime 3.9
- Function-call error rate: GPT-4o-realtime 2.1%, Gemini Live 3.3%, Sesame Maya 6.7%
- p95 latency: Sesame Maya 480ms, GPT-4o-realtime 580ms, Gemini Live 640ms
The takeaway is unambiguous: production voice agents that need to actually do things (book, lookup, transact) lean GPT-4o-realtime. Customer-facing brand experiences where the conversation is the product lean Sesame.
Where All Three Still Break
- Background-noise heavy environments (drive-throughs, factory floors): all three drop 5-10 points
- Heavy overlap and cross-talk: barge-in handling is okay but not great in any
- Code-switching languages mid-utterance: Gemini handles it best; the others struggle
Sources
- OpenAI Realtime API documentation — https://platform.openai.com/docs/guides/realtime
- Gemini Live documentation — https://ai.google.dev/gemini-api/docs/live
- Sesame Maya announcement — https://www.sesame.com
- "Voice agent benchmarks" Daily.co 2026 — https://www.daily.co/blog
- "S2S vs cascade" 2026 industry survey — https://www.deepgram.com/blog
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.