Skip to content
AI Voice Agents
AI Voice Agents10 min read0 views

Twilio <Gather> + AI Speech Recognition: Multi-Provider Models (2026)

<Gather> now picks Google V2, Deepgram Nova-2, or Twilio's own model per language. We benchmark the four modes, show hint tuning, and explain when to ditch <Gather> for <Stream>.

TL;DR<Gather input="speech"> is fine for short utterance IVR. For multi-turn conversational AI, use <Stream> + a Realtime API. The 2026 multi-provider Gather (Google V2, Deepgram Nova-2, Twilio) closes the gap on accuracy but not on latency.

Background

<Gather> is Twilio's utterance-based speech-to-text verb. You play a prompt, Twilio captures speech, returns the transcript to your webhook. In 2025–2026 Twilio added:

  • Multi-provider mode — Google V2 (Chirp), Deepgram Nova-2, Twilio's own.
  • Customer-picks vs Twilio-picks model routing.
  • Experimental models experimental_conversations and experimental_utterances.
  • Enhanced phone_call model for long-form telephony audio.
  • 119 languages and dialects supported.

Architecture / config

flowchart LR
  CALL[Caller speaks] --> GATHER[&lt;Gather input=&quot;speech&quot;&gt;]
  GATHER --> ROUTE{speechModel}
  ROUTE -->|google_v2| GV2[Google Chirp V2]
  ROUTE -->|deepgram_nova-2| DG[Deepgram Nova-2]
  ROUTE -->|phone_call| TW[Twilio phone_call]
  ROUTE -->|default| AUTO[Twilio picks]
  GV2 --> WH[Your webhook /handle-speech]
  DG --> WH
  TW --> WH
  AUTO --> WH

CallSphere implementation

CallSphere uses <Gather> only for the first prompt ("Press 1 for English, 2 for Spanish, or just say it") and then hands the call to a bidirectional <Stream> so the OpenAI Realtime agent can run free. Twilio fronts every product — Healthcare FastAPI :8084, Sales (5 concurrent outbound), After-hours (voice + SMS race in 120 s), all four GTM tools.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Footprint: 37 agents · 90+ tools · 115+ DB tables · 6 verticals · HIPAA + SOC 2 · $149 / $499 / $1499 · 14-day trial · 22% affiliate.

Build steps with code

<Response>
  <Gather input="speech"
          speechModel="deepgram_nova-2"
          speechTimeout="auto"
          language="en-US"
          hints="bookings, refill, prescription, doctor, urgent"
          action="/voice/handle-speech"
          method="POST">
    <Say voice="Polly.Joanna-Neural">How can I help you today?</Say>
  </Gather>
  <Redirect>/voice/no-input</Redirect>
</Response>
// /voice/handle-speech
app.post("/voice/handle-speech", async (req, res) => {
  const text = (req.body.SpeechResult || "").toLowerCase();
  const conf = parseFloat(req.body.Confidence || "0");
  if (conf < 0.55) return res.type("text/xml").send(reprompt());
  const intent = await classify(text);          // tiny LLM call
  return res.type("text/xml").send(routeTo(intent));
});

Hint tuning: list 10–20 domain words. Lifts confidence on niche terms (e.g., hydroxyzine, Aetna) by 8–15 %.

Pitfalls

  • speechTimeout=auto plus barge-in — barge-in only works on <Say> and <Play>, not while Twilio is "listening" — design prompts accordingly.
  • Picking the wrong model for the language — Deepgram Nova-2 is English-centric; Google Chirp wins on Spanish, Hindi, Arabic.
  • No fallback — always <Redirect> to a no-input handler with a re-prompt counter capped at 2.
  • Trusting Confidence blindly — anything under 0.55 is a coin flip. Re-prompt or escalate.
  • Using <Gather> for full conversation — every turn costs an HTTP round trip. Switch to <Stream>.

FAQ

Q: <Gather> vs <Stream> cost? <Gather> includes STT in the per-minute speech surcharge (~$0.02/min). <Stream> is just media transport; you bring your own STT.

Q: Best model for English healthcare? Deepgram Nova-2 with hints= for med names. We see ~92 % WER on real calls.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Q: Can I use partial results? Yes — set partialResultCallback. Useful for streaming intent classification before the user stops speaking.

Q: DTMF + speech? input="dtmf speech" accepts both. Set numDigits and finishOnKey carefully.

Q: When to switch to <Stream>? Once you need true barge-in, > 1 turn, or sub-300 ms response — switch.

Sources

## How this plays out in production One layer below what *Twilio + AI Speech Recognition: Multi-Provider Models (2026)* covers, the practical question every team hits is multi-turn handoffs between specialist agents without losing slot state, sentiment, or escalation context. Treat this as a voice-first system from the first prompt: the agent's persona, its tool surface, and its escalation rules all flow from that single decision. Teams that ship fast tend to instrument the loop end-to-end before they tune any single component, because the bottleneck is rarely where intuition puts it. ## Voice agent architecture, end to end A production-grade voice stack at CallSphere stitches Twilio Programmable Voice (PSTN ingress, TwiML, bidirectional Media Streams) to a realtime reasoning layer — typically OpenAI Realtime or ElevenLabs Conversational AI — with sub-second response as a hard SLO. Anything north of one second of perceived silence and callers either repeat themselves or hang up; that single number drives the whole architecture. Server-side VAD with proper barge-in support is non-negotiable, otherwise the agent talks over the caller and the conversation collapses. Streaming TTS with phoneme-aligned interruption keeps the cadence natural even when the user changes their mind mid-sentence. Post-call, every transcript is run through a structured pipeline: sentiment, intent classification, lead score, escalation flag, and a normalized slot extraction (name, callback number, reason, urgency). For healthcare workloads, the BAA-covered storage path, audit logs, encryption-at-rest, and PHI-safe transcript redaction are wired in from day one, not bolted on at compliance review. The end state is a system where every call produces a row of structured data, not just a recording. ## FAQ **What is the fastest path to a voice agent the way *Twilio + AI Speech Recognition: Multi-Provider Models (2026)* describes?** Treat the architecture in this post as a starting point and instrument it before you tune it. The metrics that matter most early on are end-to-end latency (target < 1s for voice, < 3s for chat), barge-in correctness, tool-call success rate, and post-conversation lead score distribution. Optimize whatever the data flags as the bottleneck, not whatever feels slowest in your head. **What are the gotchas around voice agent deployments at scale?** The two failure modes that bite hardest are silent context loss across multi-turn handoffs and tool calls that succeed in dev but get rate-limited in production. Both are solvable with a proper agent backplane that pins state to a session ID, retries with backoff, and writes every tool invocation to an audit log you can replay. **What does the CallSphere outbound sales calling product do that a regular dialer does not?** It uses the ElevenLabs "Sarah" voice, runs up to 5 concurrent outbound calls per operator, and ships with a browser-based dialer that transfers warm calls back to a human in one click. Dispositions, transcripts, and lead scores write back to the CRM automatically. ## See it live Book a 30-minute working session at [calendly.com/sagar-callsphere/new-meeting](https://calendly.com/sagar-callsphere/new-meeting) and bring a real call flow — we will walk it through the live outbound sales dialer at [sales.callsphere.tech](https://sales.callsphere.tech) and show you exactly where the production wiring sits.
Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.