Twilio Voice Integration for AI Agents: Building Phone-Based AI Assistants

Why Twilio for AI Voice Agents

Twilio is the most widely adopted cloud communications platform, providing programmable access to the global phone network. For AI agent developers, Twilio offers the critical bridge between an intelligent language model and a traditional phone call — letting your agent answer calls, speak to callers, and process their responses in real time.

The key components you will work with are Twilio Voice (call control), TwiML (telephony markup), and Media Streams (raw audio access). Together, these let you build AI assistants that are indistinguishable from human operators on a phone call.

Setting Up the Twilio Environment

Start by installing the required packages and configuring your Twilio account:

sequenceDiagram
    autonumber
    participant Caller as Caller
    participant Agent as CallSphere Agent
    participant API as CRM API
    participant DB as CRM Database
    participant Webhook as Webhook Listener
    Caller->>Agent: Inbound call begins
    Agent->>Agent: STT plus intent detection
    Agent->>API: Lookup contact by phone
    API->>DB: Read contact record
    DB-->>API: Contact and history
    API-->>Agent: Personalized context
    Agent->>API: Create call activity
    Agent->>API: Update deal stage
    API->>Webhook: Outbound webhook fires
    Webhook-->>Agent: Confirmed
    Agent->>Caller: Spoken confirmation

import os
from twilio.rest import Client
from twilio.twiml.voice_response import VoiceResponse, Connect

account_sid = os.environ["TWILIO_ACCOUNT_SID"]
auth_token = os.environ["TWILIO_AUTH_TOKEN"]
client = Client(account_sid, auth_token)

# Purchase or use an existing phone number
phone_number = client.incoming_phone_numbers.list(limit=1)[0]
print(f"Using number: {phone_number.phone_number}")

You need a publicly accessible webhook URL that Twilio will call when a phone call arrives. In production, this is your server's domain. During development, tools like ngrok create a tunnel to your local machine.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Handling Incoming Calls with TwiML

When someone calls your Twilio number, Twilio sends an HTTP request to your webhook. You respond with TwiML — an XML dialect that controls call behavior:

from fastapi import FastAPI, Request
from fastapi.responses import Response

app = FastAPI()

@app.post("/incoming-call")
async def handle_incoming_call(request: Request):
    """Webhook that Twilio hits when a call arrives."""
    form_data = await request.form()
    caller = form_data.get("From", "Unknown")

    response = VoiceResponse()
    response.say(
        "Hello! You have reached our AI assistant. "
        "How can I help you today?",
        voice="Polly.Joanna",
        language="en-US",
    )

    # Gather speech input from the caller
    gather = response.gather(
        input="speech",
        action="/process-speech",
        speech_timeout="auto",
        language="en-US",
    )
    gather.say("I am listening.")

    # Fallback if no input detected
    response.say("I did not hear anything. Goodbye.")
    response.hangup()

    return Response(
        content=str(response),
        media_type="application/xml",
    )

The <Gather> verb captures the caller's speech and sends it as text to your /process-speech endpoint, where you can feed it to an AI model.

Bi-Directional Audio with Media Streams

For real-time AI interaction — where the agent can interrupt, respond with low latency, and process audio continuously — you need Twilio Media Streams. This opens a WebSocket connection that streams raw audio in both directions:

import json
import base64
import asyncio
import websockets

@app.post("/media-stream-call")
async def media_stream_call(request: Request):
    """Route the call into a WebSocket media stream."""
    response = VoiceResponse()
    connect = Connect()
    stream = connect.stream(url="wss://yourdomain.com/media-socket")
    stream.parameter(name="caller_id", value="agent-001")
    response.append(connect)
    return Response(content=str(response), media_type="application/xml")

async def handle_media_socket(websocket):
    """Process bi-directional audio over WebSocket."""
    stream_sid = None

    async for message in websocket:
        data = json.loads(message)
        event_type = data.get("event")

        if event_type == "start":
            stream_sid = data["start"]["streamSid"]
            print(f"Stream started: {stream_sid}")

        elif event_type == "media":
            # Incoming audio from caller (mulaw 8kHz)
            audio_payload = base64.b64decode(data["media"]["payload"])
            # Send audio to your STT engine for transcription
            transcript = await transcribe_audio_chunk(audio_payload)

            if transcript:
                # Generate AI response and convert to audio
                ai_response = await get_ai_response(transcript)
                audio_bytes = await text_to_speech(ai_response)

                # Send audio back to the caller
                media_message = {
                    "event": "media",
                    "streamSid": stream_sid,
                    "media": {
                        "payload": base64.b64encode(audio_bytes).decode()
                    },
                }
                await websocket.send(json.dumps(media_message))

        elif event_type == "stop":
            print("Stream ended")
            break

Media Streams deliver audio as base64-encoded mulaw at 8000 Hz. Your pipeline must decode this, run speech-to-text, generate the AI response, synthesize speech, re-encode to mulaw, and send it back — all within a few hundred milliseconds for natural conversation flow.

Configuring the Webhook on Your Twilio Number

Once your server is running, point your Twilio phone number at it:

phone_number = client.incoming_phone_numbers.list(limit=1)[0]
client.incoming_phone_numbers(phone_number.sid).update(
    voice_url="https://yourdomain.com/incoming-call",
    voice_method="POST",
    status_callback="https://yourdomain.com/call-status",
    status_callback_method="POST",
)
print(f"Webhook configured for {phone_number.phone_number}")

The status_callback URL receives events like call initiation, ringing, answered, and completed — useful for logging and analytics.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Production Considerations

For production deployments, implement these patterns: use connection pooling for your WebSocket server, handle stream reconnections gracefully, implement silence detection to avoid sending empty audio to your STT engine, and add timeout handling for calls where the caller goes silent for too long. Monitor your Twilio usage closely — Media Streams are billed per minute of active streaming.

FAQ

What audio format does Twilio Media Streams use?

Twilio Media Streams deliver audio as base64-encoded mulaw (G.711 u-law) at 8000 Hz, mono channel. When sending audio back, you must encode in the same format. Most speech-to-text engines accept mulaw directly, but text-to-speech output often needs conversion from PCM or MP3 to mulaw before sending.

Can I use Twilio Gather instead of Media Streams for simpler use cases?

Yes. The <Gather> TwiML verb with input="speech" handles speech recognition on Twilio's side and delivers transcribed text to your webhook. This is simpler to implement but adds latency (typically 1-3 seconds) and does not support real-time interruption. Use Gather for simple menu navigation and Media Streams for conversational AI.

How do I handle concurrent calls on the same Twilio number?

Twilio automatically handles concurrent calls — each call gets its own webhook request and its own WebSocket stream. Your server must be stateless per-call (use the CallSid as a session key) and handle concurrency. In production, run multiple server instances behind a load balancer and use Redis or similar for shared call state.

#Twilio #VoiceAI #Telephony #MediaStreams #Python #AIAgents #AgenticAI #LearnAI #AIEngineering

Twilio Voice Integration for AI Agents: Building Phone-Based AI Assistants

Why Twilio for AI Voice Agents

Setting Up the Twilio Environment

Handling Incoming Calls with TwiML

Bi-Directional Audio with Media Streams

Configuring the Webhook on Your Twilio Number

Production Considerations

FAQ

What audio format does Twilio Media Streams use?

Can I use Twilio Gather instead of Media Streams for simpler use cases?

How do I handle concurrent calls on the same Twilio number?

Try CallSphere AI Voice Agents

Related Articles You May Like

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

Defense, ITAR & AI Voice Vendor Compliance in 2026

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)