Telephony Integration for Voice Agents: Connecting to Phone Systems
Connect your AI voice agents to real phone systems using SIP, Twilio, and WebSocket transport with the OpenAI Realtime API for inbound and outbound call handling.
Bridging AI Voice Agents and the Phone Network
A voice agent running in a browser demo is impressive. A voice agent that answers your business phone line is useful. The gap between those two is telephony integration — connecting your AI agent to the Public Switched Telephone Network (PSTN) so real callers on real phones can interact with it.
This post covers three integration patterns: direct SIP trunking, Twilio as a telephony middleware, and raw WebSocket transport for custom deployments.
Telephony Architecture Patterns
Pattern 1: Twilio Media Streams + OpenAI Realtime API
This is the most accessible approach. Twilio handles all telephony complexity (phone numbers, call routing, PSTN connectivity) and forwards raw audio to your server via WebSocket Media Streams.
sequenceDiagram
autonumber
participant Caller as Caller
participant Agent as CallSphere Agent
participant API as CRM API
participant DB as CRM Database
participant Webhook as Webhook Listener
Caller->>Agent: Inbound call begins
Agent->>Agent: STT plus intent detection
Agent->>API: Lookup contact by phone
API->>DB: Read contact record
DB-->>API: Contact and history
API-->>Agent: Personalized context
Agent->>API: Create call activity
Agent->>API: Update deal stage
API->>Webhook: Outbound webhook fires
Webhook-->>Agent: Confirmed
Agent->>Caller: Spoken confirmation
┌──────────┐ PSTN ┌──────────┐ Media Stream ┌──────────────┐
│ Caller │────────────►│ Twilio │◄────────────────►│ Your Server │
│ (Phone) │ │ │ (WebSocket) │ (FastAPI) │
└──────────┘ └──────────┘ └──────┬───────┘
│
┌──────▼───────┐
│ OpenAI │
│ Realtime API │
└──────────────┘
Pattern 2: Direct SIP Trunk
For high-volume call centers, you connect your SIP-capable server directly to a SIP trunk provider. This eliminates the Twilio middleman but requires you to handle SIP signaling, codec negotiation, and RTP media streams yourself.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Pattern 3: WebRTC Gateway
For browser-based or mobile app callers, you use a WebRTC gateway that bridges browser audio to your voice agent pipeline. This is the approach used in web-based customer portals.
Implementation: Twilio Media Streams
Step 1: Twilio Configuration
First, configure a Twilio phone number to forward calls to your server via TwiML.
# twilio_config.py
from twilio.rest import Client
import os
client = Client(
os.environ["TWILIO_ACCOUNT_SID"],
os.environ["TWILIO_AUTH_TOKEN"],
)
def configure_phone_number(phone_sid: str, webhook_url: str):
"""Point a Twilio phone number at our voice webhook."""
client.incoming_phone_numbers(phone_sid).update(
voice_url=f"{webhook_url}/twilio/voice",
voice_method="POST",
)
Step 2: TwiML Voice Webhook
When Twilio receives a call, it hits your webhook. You respond with TwiML that opens a Media Stream WebSocket back to your server.
# main.py
from fastapi import FastAPI, Request
from fastapi.responses import Response
app = FastAPI()
@app.post("/twilio/voice")
async def twilio_voice_webhook(request: Request):
"""Twilio calls this when a new inbound call arrives."""
form = await request.form()
caller = form.get("From", "unknown")
call_sid = form.get("CallSid", "")
twiml = f"""<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Say voice="alice">Please hold while we connect you to our assistant.</Say>
<Connect>
<Stream url="wss://{request.headers['host']}/twilio/media-stream">
<Parameter name="caller" value="{caller}" />
<Parameter name="call_sid" value="{call_sid}" />
</Stream>
</Connect>
</Response>"""
return Response(content=twiml, media_type="application/xml")
Step 3: Media Stream WebSocket Handler
This is the core: a WebSocket endpoint that receives Twilio's audio stream, forwards it to OpenAI's Realtime API, and sends the response audio back to Twilio.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
# media_stream.py
import asyncio
import json
import base64
import websockets
from fastapi import WebSocket, WebSocketDisconnect
import os
OPENAI_REALTIME_URL = "wss://api.openai.com/v1/realtime"
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
SYSTEM_INSTRUCTIONS = """You are a helpful customer support agent for Acme Corp.
You are speaking with a customer on the phone. Keep responses concise and natural.
When you need to look up information, tell the customer you are checking.
If you cannot help, offer to transfer them to a human agent."""
async def handle_twilio_media_stream(websocket: WebSocket):
"""Bridge between Twilio Media Stream and OpenAI Realtime API."""
await websocket.accept()
stream_sid = None
caller = "unknown"
# Connect to OpenAI Realtime API
headers = {
"Authorization": f"Bearer {OPENAI_API_KEY}",
"OpenAI-Beta": "realtime=v1",
}
async with websockets.connect(
f"{OPENAI_REALTIME_URL}?model=gpt-4o-realtime-preview",
additional_headers=headers,
) as openai_ws:
# Configure the OpenAI session
session_config = {
"type": "session.update",
"session": {
"instructions": SYSTEM_INSTRUCTIONS,
"voice": "nova",
"input_audio_format": "g711_ulaw",
"output_audio_format": "g711_ulaw",
"input_audio_transcription": {"model": "whisper-1"},
"turn_detection": {
"type": "server_vad",
"threshold": 0.5,
"prefix_padding_ms": 300,
"silence_duration_ms": 700,
},
},
}
await openai_ws.send(json.dumps(session_config))
async def twilio_to_openai():
"""Forward Twilio audio to OpenAI."""
nonlocal stream_sid, caller
try:
while True:
message = await websocket.receive_text()
data = json.loads(message)
if data["event"] == "start":
stream_sid = data["start"]["streamSid"]
params = data["start"].get("customParameters", {})
caller = params.get("caller", "unknown")
elif data["event"] == "media":
audio_payload = data["media"]["payload"]
audio_event = {
"type": "input_audio_buffer.append",
"audio": audio_payload,
}
await openai_ws.send(json.dumps(audio_event))
elif data["event"] == "stop":
break
except WebSocketDisconnect:
pass
async def openai_to_twilio():
"""Forward OpenAI audio back to Twilio."""
try:
async for message in openai_ws:
data = json.loads(message)
if data["type"] == "response.audio.delta":
audio_delta = data["delta"]
twilio_message = {
"event": "media",
"streamSid": stream_sid,
"media": {"payload": audio_delta},
}
await websocket.send_json(twilio_message)
elif data["type"] == "response.audio.done":
# Mark end of response for logging
pass
elif data["type"] == "input_audio_buffer.speech_started":
# User started speaking — clear any pending audio
clear_msg = {
"event": "clear",
"streamSid": stream_sid,
}
await websocket.send_json(clear_msg)
except Exception:
pass
await asyncio.gather(twilio_to_openai(), openai_to_twilio())
Step 4: Register the WebSocket Route
# In main.py, add the media stream route
from media_stream import handle_twilio_media_stream
@app.websocket("/twilio/media-stream")
async def twilio_media_stream(websocket: WebSocket):
await handle_twilio_media_stream(websocket)
Outbound Calls
Voice agents can also initiate calls — for appointment reminders, follow-ups, or proactive support.
# outbound.py
from twilio.rest import Client
import os
client = Client(
os.environ["TWILIO_ACCOUNT_SID"],
os.environ["TWILIO_AUTH_TOKEN"],
)
def initiate_outbound_call(
to_number: str,
from_number: str,
webhook_base_url: str,
purpose: str = "follow_up",
):
"""Initiate an outbound call that connects to our AI agent."""
twiml_url = f"{webhook_base_url}/twilio/outbound-voice?purpose={purpose}"
call = client.calls.create(
to=to_number,
from_=from_number,
url=twiml_url,
method="POST",
status_callback=f"{webhook_base_url}/twilio/call-status",
status_callback_event=["initiated", "ringing", "answered", "completed"],
)
return call.sid
@app.post("/twilio/outbound-voice")
async def outbound_voice_webhook(request: Request):
"""Handle the outbound call connection."""
params = request.query_params
purpose = params.get("purpose", "general")
twiml = f"""<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Connect>
<Stream url="wss://{request.headers['host']}/twilio/media-stream">
<Parameter name="direction" value="outbound" />
<Parameter name="purpose" value="{purpose}" />
</Stream>
</Connect>
</Response>"""
return Response(content=twiml, media_type="application/xml")
DTMF Tone Handling
Some callers prefer pressing buttons. You can handle DTMF input alongside voice by gathering digits before connecting the Media Stream.
@app.post("/twilio/voice-with-dtmf")
async def voice_with_dtmf(request: Request):
"""Offer a DTMF menu before connecting to the AI agent."""
twiml = f"""<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Gather numDigits="1" action="/twilio/dtmf-handler" method="POST" timeout="5">
<Say voice="alice">
Press 1 for billing, 2 for refunds, or stay on the line
to speak with our AI assistant.
</Say>
</Gather>
<Connect>
<Stream url="wss://{request.headers['host']}/twilio/media-stream">
<Parameter name="department" value="triage" />
</Stream>
</Connect>
</Response>"""
return Response(content=twiml, media_type="application/xml")
@app.post("/twilio/dtmf-handler")
async def dtmf_handler(request: Request):
"""Route based on DTMF digit pressed."""
form = await request.form()
digit = form.get("Digits", "")
department_map = {"1": "billing", "2": "refunds"}
department = department_map.get(digit, "triage")
twiml = f"""<?xml version="1.0" encoding="UTF-8"?>
<Response>
<Say voice="alice">Connecting you now.</Say>
<Connect>
<Stream url="wss://{request.headers['host']}/twilio/media-stream">
<Parameter name="department" value="{department}" />
</Stream>
</Connect>
</Response>"""
return Response(content=twiml, media_type="application/xml")
SIP Integration Overview
For direct SIP integration without Twilio, you need a SIP stack. The open-source library pjsip (via pjsua2 Python bindings) handles SIP signaling, while you manage the RTP audio stream yourself.
# sip_overview.py (conceptual — requires pjsua2)
"""
SIP integration requires three components:
1. SIP User Agent — registers with your SIP provider and handles
INVITE/BYE/CANCEL signaling
2. RTP Media Handler — receives and sends audio packets using
the negotiated codec (typically G.711 u-law or a-law)
3. Audio Bridge — converts between RTP packets and the PCM16
format expected by OpenAI's Realtime API
The flow:
SIP INVITE → Accept call → Negotiate codec → Open RTP stream
→ Forward RTP audio to OpenAI → Receive response audio
→ Send as RTP back to caller → BYE to end call
Key considerations:
- Codec negotiation: Prefer G.711 u-law for compatibility
- NAT traversal: Use STUN/TURN if your server is behind NAT
- Registration refresh: SIP registrations expire; re-register periodically
- Call recording: Tap the RTP stream for compliance recording
"""
Production Checklist
When deploying telephony-connected voice agents:
- Phone number management: Use a pool of numbers for outbound calls to avoid spam flagging
- Call recording consent: Announce recording at the start of each call where legally required
- Failover: If the AI pipeline is down, fall back to a traditional IVR or voicemail
- Cost monitoring: Track per-minute costs across Twilio, OpenAI Realtime API, and compute
- Concurrent call limits: Size your WebSocket server for your peak concurrent call volume
- Audio quality logging: Log audio quality metrics (jitter, packet loss) for debugging
Sources:
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.