Twilio <Gather> + AI Speech Recognition: Multi-Provider Models (2026)
<Gather> now picks Google V2, Deepgram Nova-2, or Twilio's own model per language. We benchmark the four modes, show hint tuning, and explain when to ditch <Gather> for <Stream>.
TL;DR —
<Gather input="speech">is fine for short utterance IVR. For multi-turn conversational AI, use<Stream>+ a Realtime API. The 2026 multi-provider Gather (Google V2, Deepgram Nova-2, Twilio) closes the gap on accuracy but not on latency.
Background
<Gather> is Twilio's utterance-based speech-to-text verb. You play a prompt, Twilio captures speech, returns the transcript to your webhook. In 2025–2026 Twilio added:
- Multi-provider mode — Google V2 (Chirp), Deepgram Nova-2, Twilio's own.
- Customer-picks vs Twilio-picks model routing.
- Experimental models
experimental_conversationsandexperimental_utterances. - Enhanced phone_call model for long-form telephony audio.
- 119 languages and dialects supported.
Architecture / config
flowchart LR
CALL[Caller speaks] --> GATHER[<Gather input="speech">]
GATHER --> ROUTE{speechModel}
ROUTE -->|google_v2| GV2[Google Chirp V2]
ROUTE -->|deepgram_nova-2| DG[Deepgram Nova-2]
ROUTE -->|phone_call| TW[Twilio phone_call]
ROUTE -->|default| AUTO[Twilio picks]
GV2 --> WH[Your webhook /handle-speech]
DG --> WH
TW --> WH
AUTO --> WH
CallSphere implementation
CallSphere uses <Gather> only for the first prompt ("Press 1 for English, 2 for Spanish, or just say it") and then hands the call to a bidirectional <Stream> so the OpenAI Realtime agent can run free. Twilio fronts every product — Healthcare FastAPI :8084, Sales (5 concurrent outbound), After-hours (voice + SMS race in 120 s), all four GTM tools.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Footprint: 37 agents · 90+ tools · 115+ DB tables · 6 verticals · HIPAA + SOC 2 · $149 / $499 / $1499 · 14-day trial · 22% affiliate.
Build steps with code
<Response>
<Gather input="speech"
speechModel="deepgram_nova-2"
speechTimeout="auto"
language="en-US"
hints="bookings, refill, prescription, doctor, urgent"
action="/voice/handle-speech"
method="POST">
<Say voice="Polly.Joanna-Neural">How can I help you today?</Say>
</Gather>
<Redirect>/voice/no-input</Redirect>
</Response>
// /voice/handle-speech
app.post("/voice/handle-speech", async (req, res) => {
const text = (req.body.SpeechResult || "").toLowerCase();
const conf = parseFloat(req.body.Confidence || "0");
if (conf < 0.55) return res.type("text/xml").send(reprompt());
const intent = await classify(text); // tiny LLM call
return res.type("text/xml").send(routeTo(intent));
});
Hint tuning: list 10–20 domain words. Lifts confidence on niche terms (e.g., hydroxyzine, Aetna) by 8–15 %.
Pitfalls
speechTimeout=autoplus barge-in — barge-in only works on<Say>and<Play>, not while Twilio is "listening" — design prompts accordingly.- Picking the wrong model for the language — Deepgram Nova-2 is English-centric; Google Chirp wins on Spanish, Hindi, Arabic.
- No fallback — always
<Redirect>to a no-input handler with a re-prompt counter capped at 2. - Trusting
Confidenceblindly — anything under 0.55 is a coin flip. Re-prompt or escalate. - Using
<Gather>for full conversation — every turn costs an HTTP round trip. Switch to<Stream>.
FAQ
Q: <Gather> vs <Stream> cost?
<Gather> includes STT in the per-minute speech surcharge (~$0.02/min). <Stream> is just media transport; you bring your own STT.
Q: Best model for English healthcare?
Deepgram Nova-2 with hints= for med names. We see ~92 % WER on real calls.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Q: Can I use partial results?
Yes — set partialResultCallback. Useful for streaming intent classification before the user stops speaking.
Q: DTMF + speech?
input="dtmf speech" accepts both. Set numDigits and finishOnKey carefully.
Q: When to switch to <Stream>?
Once you need true barge-in, > 1 turn, or sub-300 ms response — switch.
Sources
- Twilio Docs —
<Gather>verb - Twilio Changelog — Multi-provider Speech Recognition
- Twilio Blog — 11 best practices for Speech Recognition
- Twilio Real-Time Speech Recognition API
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.