Bridging Twilio Voice and OpenAI Realtime: The Production Pattern in 2026
How Twilio Elastic SIP Trunking and OpenAI's Realtime SIP connector now bridge directly, what the call flow looks like, and the latency budget that actually works.
The cleanest production pattern for AI phone agents in 2026 is no longer "WebSocket proxy in the middle." Twilio Elastic SIP Trunking now hands a call directly to OpenAI's Realtime SIP endpoint, and your server only steps in to accept the session and stream tools.
Background: what changed and why now
flowchart TD
Out[Outbound campaign] --> Twilio[Twilio Voice API]
Twilio --> STIR[STIR/SHAKEN attestation]
STIR --> Carrier[Originating carrier]
Carrier --> Term[Terminating carrier]
Term --> Recipient[Recipient phone]
Recipient --> Webhook[/voice webhook/]
Webhook --> Agent[AI sales agent]Through most of 2024 and 2025, the canonical pattern for an AI phone agent on Twilio was a Node or Python WebSocket server sitting between Twilio Media Streams and OpenAI's Realtime API. The server transcoded mu-law 8 kHz audio into 16 kHz PCM16, forwarded it to OpenAI over WebSocket, transcoded the response back, and pushed it to Twilio. It worked, but every team hit the same problems: WebSocket reconnect storms during deploys, audio drift on long calls, and a fragile interruption model that lost the last 200 to 600 ms of speech when the user barged in.
In late 2025 OpenAI shipped a SIP connector for the Realtime API. The Realtime endpoint speaks SIP natively. Twilio Elastic SIP Trunking can point an origination URI directly at sip:[email protected];transport=tls. The audio path stops bouncing through your server. Your server only handles a webhook ("realtime.call.incoming"), accepts the session with a voice and an instructions block, and opens a thin WebSocket only for tool calls.
This is the production pattern most serious teams are migrating toward in 2026.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
How VoIP and SIP work for this use case
The call lifecycle splits cleanly into three legs:
- PSTN to Twilio. Caller dials a number. The originating carrier sends a SIP INVITE through Tier-1 interconnects to Twilio's edge. Twilio's SBC accepts the INVITE, sends a 100 Trying, then a 180 Ringing, then a 200 OK once the destination is ready. Voice media flows over RTP, typically G.711 mu-law at 8 kHz.
- Twilio to OpenAI. Twilio's Elastic SIP Trunk has an origination URI pointing at OpenAI's SIP edge. Twilio sends a fresh INVITE over TLS to
sip.api.openai.com. OpenAI's edge accepts, opens an SRTP media stream, and starts decoding the inbound G.711 to 24 kHz PCM for the Realtime model. - Tool plane. OpenAI fires a webhook to your server. Your server posts to
/acceptwith the model, voice, and instructions, then opens a WebSocket to receive function-call events and to push tool results.
The key insight: voice audio never touches your application server in the steady state. Your server is on the control plane only.
CallSphere implementation
CallSphere runs Twilio across all six verticals: Healthcare AI, Real Estate AI, Sales Calling AI, Salon AI, IT Helpdesk AI, and After-Hours AI. The Healthcare receptionist uses a FastAPI service on port 8084 to bridge to OpenAI Realtime; Sales Calling AI runs five concurrent outbound calls per tenant on Twilio Programmable Voice; After-Hours AI fires a simultaneous Twilio call plus SMS per on-call contact with a 120 second timeout before falling through to the next contact.
The platform ships 37 agents across 90+ tools and 115+ database tables, with HIPAA and SOC 2 controls in place. Pricing is $149, $499, and $1499 for 1, 3, and 10 numbers respectively, with a 14-day trial and a 22% recurring affiliate program. The relevant change for 2026: new Healthcare deployments default to the SIP-direct pattern; older deployments keep the WebSocket-proxy pattern until their next migration window.
Build and integration steps
- Buy a Twilio phone number and provision an Elastic SIP Trunk in the Twilio console.
- Set the trunk's origination URI to
sip:[email protected];transport=tls. - In OpenAI's dashboard, register a webhook endpoint for the
realtime.call.incomingevent, signed with HMAC. - Build a small webhook handler that verifies the signature and POSTs to OpenAI's
/acceptendpoint with model, voice, and instructions. - Open a WebSocket from your handler to OpenAI to receive
response.function_call_arguments.doneevents for tool calls. - Implement your tool functions (calendar lookup, CRM write, hand-off to human) behind that WebSocket.
- Add observability: log call SID, OpenAI session ID, tool calls, and disconnect reason; alert on accept-rate drop.
- Add a fallback TwiML bin so that if the Realtime accept fails, Twilio plays a polite voicemail capture instead of a dead line.
Code or config snippet
<!-- Twilio TwiML fallback when OpenAI accept fails -->
<Response>
<Say voice="Polly.Joanna-Neural">
Our assistant is briefly unavailable. Please leave a message after the beep.
</Say>
<Record
timeout="5"
maxLength="120"
transcribe="true"
transcribeCallback="https://api.callsphere.ai/voicemail/transcribe"
action="https://api.callsphere.ai/voicemail/done"/>
<Say>We did not capture a message. Goodbye.</Say>
</Response>
FAQ
Do I still need a WebSocket server with the SIP-direct pattern? Yes, but only for tool calls and observability. Audio bypasses your server entirely.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
What happens if my tool server is down when a call comes in? The call still completes; the model just answers without tools. CallSphere recommends a circuit breaker that switches the model to a "safe-mode" instructions block when tool servers are degraded.
Is this cheaper than the proxy pattern? Usually yes, because you stop paying for media-plane bandwidth and CPU on your application servers.
Does interruption handling improve? Materially, yes. OpenAI handles barge-in inside its own audio pipeline rather than over a round-trip through your proxy.
Can I still record the call? Yes. Enable recording on the Twilio trunk side; the SIP-direct path does not break Twilio call recordings.
Sources
- Twilio: Connect the OpenAI Realtime SIP Connector with Twilio Elastic SIP Trunking
- Twilio: Core Latency in AI Voice Agents
- Twilio: Build Conversational AI Apps with Twilio and the OpenAI Realtime API
Start a 14-day trial to see the SIP-direct pattern in production, compare pricing for 1, 3, or 10 numbers, or read the Twilio integration page for setup details.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.