Skip to content
Learn Agentic AI
Learn Agentic AI17 min read19 views

Telephony Integration for Voice Agents: Connecting to Phone Systems

Connect your AI voice agents to real phone systems using SIP, Twilio, and WebSocket transport with the OpenAI Realtime API for inbound and outbound call handling.

Bridging AI Voice Agents and the Phone Network

A voice agent running in a browser demo is impressive. A voice agent that answers your business phone line is useful. The gap between those two is telephony integration — connecting your AI agent to the Public Switched Telephone Network (PSTN) so real callers on real phones can interact with it.

This post covers three integration patterns: direct SIP trunking, Twilio as a telephony middleware, and raw WebSocket transport for custom deployments.

Telephony Architecture Patterns

Pattern 1: Twilio Media Streams + OpenAI Realtime API

This is the most accessible approach. Twilio handles all telephony complexity (phone numbers, call routing, PSTN connectivity) and forwards raw audio to your server via WebSocket Media Streams.

sequenceDiagram
    autonumber
    participant Caller as Caller
    participant Agent as CallSphere Agent
    participant API as CRM API
    participant DB as CRM Database
    participant Webhook as Webhook Listener
    Caller->>Agent: Inbound call begins
    Agent->>Agent: STT plus intent detection
    Agent->>API: Lookup contact by phone
    API->>DB: Read contact record
    DB-->>API: Contact and history
    API-->>Agent: Personalized context
    Agent->>API: Create call activity
    Agent->>API: Update deal stage
    API->>Webhook: Outbound webhook fires
    Webhook-->>Agent: Confirmed
    Agent->>Caller: Spoken confirmation
┌──────────┐    PSTN     ┌──────────┐   Media Stream   ┌──────────────┐
│  Caller  │────────────►│  Twilio  │◄────────────────►│  Your Server │
│ (Phone)  │             │          │    (WebSocket)    │  (FastAPI)   │
└──────────┘             └──────────┘                   └──────┬───────┘
                                                               │
                                                        ┌──────▼───────┐
                                                        │ OpenAI       │
                                                        │ Realtime API │
                                                        └──────────────┘

Pattern 2: Direct SIP Trunk

For high-volume call centers, you connect your SIP-capable server directly to a SIP trunk provider. This eliminates the Twilio middleman but requires you to handle SIP signaling, codec negotiation, and RTP media streams yourself.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Pattern 3: WebRTC Gateway

For browser-based or mobile app callers, you use a WebRTC gateway that bridges browser audio to your voice agent pipeline. This is the approach used in web-based customer portals.

Implementation: Twilio Media Streams

Step 1: Twilio Configuration

First, configure a Twilio phone number to forward calls to your server via TwiML.

# twilio_config.py
from twilio.rest import Client
import os

client = Client(
    os.environ["TWILIO_ACCOUNT_SID"],
    os.environ["TWILIO_AUTH_TOKEN"],
)

def configure_phone_number(phone_sid: str, webhook_url: str):
    """Point a Twilio phone number at our voice webhook."""
    client.incoming_phone_numbers(phone_sid).update(
        voice_url=f"{webhook_url}/twilio/voice",
        voice_method="POST",
    )

Step 2: TwiML Voice Webhook

When Twilio receives a call, it hits your webhook. You respond with TwiML that opens a Media Stream WebSocket back to your server.

# main.py
from fastapi import FastAPI, Request
from fastapi.responses import Response

app = FastAPI()

@app.post("/twilio/voice")
async def twilio_voice_webhook(request: Request):
    """Twilio calls this when a new inbound call arrives."""
    form = await request.form()
    caller = form.get("From", "unknown")
    call_sid = form.get("CallSid", "")

    twiml = f"""<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Say voice="alice">Please hold while we connect you to our assistant.</Say>
    <Connect>
        <Stream url="wss://{request.headers['host']}/twilio/media-stream">
            <Parameter name="caller" value="{caller}" />
            <Parameter name="call_sid" value="{call_sid}" />
        </Stream>
    </Connect>
</Response>"""

    return Response(content=twiml, media_type="application/xml")

Step 3: Media Stream WebSocket Handler

This is the core: a WebSocket endpoint that receives Twilio's audio stream, forwards it to OpenAI's Realtime API, and sends the response audio back to Twilio.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

# media_stream.py
import asyncio
import json
import base64
import websockets
from fastapi import WebSocket, WebSocketDisconnect
import os

OPENAI_REALTIME_URL = "wss://api.openai.com/v1/realtime"
OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]

SYSTEM_INSTRUCTIONS = """You are a helpful customer support agent for Acme Corp.
You are speaking with a customer on the phone. Keep responses concise and natural.
When you need to look up information, tell the customer you are checking.
If you cannot help, offer to transfer them to a human agent."""

async def handle_twilio_media_stream(websocket: WebSocket):
    """Bridge between Twilio Media Stream and OpenAI Realtime API."""
    await websocket.accept()

    stream_sid = None
    caller = "unknown"

    # Connect to OpenAI Realtime API
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "OpenAI-Beta": "realtime=v1",
    }

    async with websockets.connect(
        f"{OPENAI_REALTIME_URL}?model=gpt-4o-realtime-preview",
        additional_headers=headers,
    ) as openai_ws:

        # Configure the OpenAI session
        session_config = {
            "type": "session.update",
            "session": {
                "instructions": SYSTEM_INSTRUCTIONS,
                "voice": "nova",
                "input_audio_format": "g711_ulaw",
                "output_audio_format": "g711_ulaw",
                "input_audio_transcription": {"model": "whisper-1"},
                "turn_detection": {
                    "type": "server_vad",
                    "threshold": 0.5,
                    "prefix_padding_ms": 300,
                    "silence_duration_ms": 700,
                },
            },
        }
        await openai_ws.send(json.dumps(session_config))

        async def twilio_to_openai():
            """Forward Twilio audio to OpenAI."""
            nonlocal stream_sid, caller
            try:
                while True:
                    message = await websocket.receive_text()
                    data = json.loads(message)

                    if data["event"] == "start":
                        stream_sid = data["start"]["streamSid"]
                        params = data["start"].get("customParameters", {})
                        caller = params.get("caller", "unknown")

                    elif data["event"] == "media":
                        audio_payload = data["media"]["payload"]
                        audio_event = {
                            "type": "input_audio_buffer.append",
                            "audio": audio_payload,
                        }
                        await openai_ws.send(json.dumps(audio_event))

                    elif data["event"] == "stop":
                        break
            except WebSocketDisconnect:
                pass

        async def openai_to_twilio():
            """Forward OpenAI audio back to Twilio."""
            try:
                async for message in openai_ws:
                    data = json.loads(message)

                    if data["type"] == "response.audio.delta":
                        audio_delta = data["delta"]
                        twilio_message = {
                            "event": "media",
                            "streamSid": stream_sid,
                            "media": {"payload": audio_delta},
                        }
                        await websocket.send_json(twilio_message)

                    elif data["type"] == "response.audio.done":
                        # Mark end of response for logging
                        pass

                    elif data["type"] == "input_audio_buffer.speech_started":
                        # User started speaking — clear any pending audio
                        clear_msg = {
                            "event": "clear",
                            "streamSid": stream_sid,
                        }
                        await websocket.send_json(clear_msg)
            except Exception:
                pass

        await asyncio.gather(twilio_to_openai(), openai_to_twilio())

Step 4: Register the WebSocket Route

# In main.py, add the media stream route
from media_stream import handle_twilio_media_stream

@app.websocket("/twilio/media-stream")
async def twilio_media_stream(websocket: WebSocket):
    await handle_twilio_media_stream(websocket)

Outbound Calls

Voice agents can also initiate calls — for appointment reminders, follow-ups, or proactive support.

# outbound.py
from twilio.rest import Client
import os

client = Client(
    os.environ["TWILIO_ACCOUNT_SID"],
    os.environ["TWILIO_AUTH_TOKEN"],
)

def initiate_outbound_call(
    to_number: str,
    from_number: str,
    webhook_base_url: str,
    purpose: str = "follow_up",
):
    """Initiate an outbound call that connects to our AI agent."""
    twiml_url = f"{webhook_base_url}/twilio/outbound-voice?purpose={purpose}"

    call = client.calls.create(
        to=to_number,
        from_=from_number,
        url=twiml_url,
        method="POST",
        status_callback=f"{webhook_base_url}/twilio/call-status",
        status_callback_event=["initiated", "ringing", "answered", "completed"],
    )
    return call.sid

@app.post("/twilio/outbound-voice")
async def outbound_voice_webhook(request: Request):
    """Handle the outbound call connection."""
    params = request.query_params
    purpose = params.get("purpose", "general")

    twiml = f"""<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Connect>
        <Stream url="wss://{request.headers['host']}/twilio/media-stream">
            <Parameter name="direction" value="outbound" />
            <Parameter name="purpose" value="{purpose}" />
        </Stream>
    </Connect>
</Response>"""

    return Response(content=twiml, media_type="application/xml")

DTMF Tone Handling

Some callers prefer pressing buttons. You can handle DTMF input alongside voice by gathering digits before connecting the Media Stream.

@app.post("/twilio/voice-with-dtmf")
async def voice_with_dtmf(request: Request):
    """Offer a DTMF menu before connecting to the AI agent."""
    twiml = f"""<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Gather numDigits="1" action="/twilio/dtmf-handler" method="POST" timeout="5">
        <Say voice="alice">
            Press 1 for billing, 2 for refunds, or stay on the line
            to speak with our AI assistant.
        </Say>
    </Gather>
    <Connect>
        <Stream url="wss://{request.headers['host']}/twilio/media-stream">
            <Parameter name="department" value="triage" />
        </Stream>
    </Connect>
</Response>"""

    return Response(content=twiml, media_type="application/xml")

@app.post("/twilio/dtmf-handler")
async def dtmf_handler(request: Request):
    """Route based on DTMF digit pressed."""
    form = await request.form()
    digit = form.get("Digits", "")

    department_map = {"1": "billing", "2": "refunds"}
    department = department_map.get(digit, "triage")

    twiml = f"""<?xml version="1.0" encoding="UTF-8"?>
<Response>
    <Say voice="alice">Connecting you now.</Say>
    <Connect>
        <Stream url="wss://{request.headers['host']}/twilio/media-stream">
            <Parameter name="department" value="{department}" />
        </Stream>
    </Connect>
</Response>"""

    return Response(content=twiml, media_type="application/xml")

SIP Integration Overview

For direct SIP integration without Twilio, you need a SIP stack. The open-source library pjsip (via pjsua2 Python bindings) handles SIP signaling, while you manage the RTP audio stream yourself.

# sip_overview.py (conceptual — requires pjsua2)
"""
SIP integration requires three components:

1. SIP User Agent — registers with your SIP provider and handles
   INVITE/BYE/CANCEL signaling
2. RTP Media Handler — receives and sends audio packets using
   the negotiated codec (typically G.711 u-law or a-law)
3. Audio Bridge — converts between RTP packets and the PCM16
   format expected by OpenAI's Realtime API

The flow:
  SIP INVITE → Accept call → Negotiate codec → Open RTP stream
  → Forward RTP audio to OpenAI → Receive response audio
  → Send as RTP back to caller → BYE to end call

Key considerations:
- Codec negotiation: Prefer G.711 u-law for compatibility
- NAT traversal: Use STUN/TURN if your server is behind NAT
- Registration refresh: SIP registrations expire; re-register periodically
- Call recording: Tap the RTP stream for compliance recording
"""

Production Checklist

When deploying telephony-connected voice agents:

  1. Phone number management: Use a pool of numbers for outbound calls to avoid spam flagging
  2. Call recording consent: Announce recording at the start of each call where legally required
  3. Failover: If the AI pipeline is down, fall back to a traditional IVR or voicemail
  4. Cost monitoring: Track per-minute costs across Twilio, OpenAI Realtime API, and compute
  5. Concurrent call limits: Size your WebSocket server for your peak concurrent call volume
  6. Audio quality logging: Log audio quality metrics (jitter, packet loss) for debugging

Sources:

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like