Skip to content
Learn Agentic AI
Learn Agentic AI14 min read11 views

Twilio Voice Integration for AI Agents: Building Phone-Based AI Assistants

Learn how to connect AI agents to the phone network using Twilio Voice, TwiML, and Media Streams. Covers bi-directional audio, real-time speech processing, and production deployment patterns.

Why Twilio for AI Voice Agents

Twilio is the most widely adopted cloud communications platform, providing programmable access to the global phone network. For AI agent developers, Twilio offers the critical bridge between an intelligent language model and a traditional phone call — letting your agent answer calls, speak to callers, and process their responses in real time.

The key components you will work with are Twilio Voice (call control), TwiML (telephony markup), and Media Streams (raw audio access). Together, these let you build AI assistants that are indistinguishable from human operators on a phone call.

Setting Up the Twilio Environment

Start by installing the required packages and configuring your Twilio account:

sequenceDiagram
    autonumber
    participant Caller as Caller
    participant Agent as CallSphere Agent
    participant API as CRM API
    participant DB as CRM Database
    participant Webhook as Webhook Listener
    Caller->>Agent: Inbound call begins
    Agent->>Agent: STT plus intent detection
    Agent->>API: Lookup contact by phone
    API->>DB: Read contact record
    DB-->>API: Contact and history
    API-->>Agent: Personalized context
    Agent->>API: Create call activity
    Agent->>API: Update deal stage
    API->>Webhook: Outbound webhook fires
    Webhook-->>Agent: Confirmed
    Agent->>Caller: Spoken confirmation
import os
from twilio.rest import Client
from twilio.twiml.voice_response import VoiceResponse, Connect

account_sid = os.environ["TWILIO_ACCOUNT_SID"]
auth_token = os.environ["TWILIO_AUTH_TOKEN"]
client = Client(account_sid, auth_token)

# Purchase or use an existing phone number
phone_number = client.incoming_phone_numbers.list(limit=1)[0]
print(f"Using number: {phone_number.phone_number}")

You need a publicly accessible webhook URL that Twilio will call when a phone call arrives. In production, this is your server's domain. During development, tools like ngrok create a tunnel to your local machine.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Handling Incoming Calls with TwiML

When someone calls your Twilio number, Twilio sends an HTTP request to your webhook. You respond with TwiML — an XML dialect that controls call behavior:

from fastapi import FastAPI, Request
from fastapi.responses import Response

app = FastAPI()

@app.post("/incoming-call")
async def handle_incoming_call(request: Request):
    """Webhook that Twilio hits when a call arrives."""
    form_data = await request.form()
    caller = form_data.get("From", "Unknown")

    response = VoiceResponse()
    response.say(
        "Hello! You have reached our AI assistant. "
        "How can I help you today?",
        voice="Polly.Joanna",
        language="en-US",
    )

    # Gather speech input from the caller
    gather = response.gather(
        input="speech",
        action="/process-speech",
        speech_timeout="auto",
        language="en-US",
    )
    gather.say("I am listening.")

    # Fallback if no input detected
    response.say("I did not hear anything. Goodbye.")
    response.hangup()

    return Response(
        content=str(response),
        media_type="application/xml",
    )

The <Gather> verb captures the caller's speech and sends it as text to your /process-speech endpoint, where you can feed it to an AI model.

Bi-Directional Audio with Media Streams

For real-time AI interaction — where the agent can interrupt, respond with low latency, and process audio continuously — you need Twilio Media Streams. This opens a WebSocket connection that streams raw audio in both directions:

import json
import base64
import asyncio
import websockets

@app.post("/media-stream-call")
async def media_stream_call(request: Request):
    """Route the call into a WebSocket media stream."""
    response = VoiceResponse()
    connect = Connect()
    stream = connect.stream(url="wss://yourdomain.com/media-socket")
    stream.parameter(name="caller_id", value="agent-001")
    response.append(connect)
    return Response(content=str(response), media_type="application/xml")

async def handle_media_socket(websocket):
    """Process bi-directional audio over WebSocket."""
    stream_sid = None

    async for message in websocket:
        data = json.loads(message)
        event_type = data.get("event")

        if event_type == "start":
            stream_sid = data["start"]["streamSid"]
            print(f"Stream started: {stream_sid}")

        elif event_type == "media":
            # Incoming audio from caller (mulaw 8kHz)
            audio_payload = base64.b64decode(data["media"]["payload"])
            # Send audio to your STT engine for transcription
            transcript = await transcribe_audio_chunk(audio_payload)

            if transcript:
                # Generate AI response and convert to audio
                ai_response = await get_ai_response(transcript)
                audio_bytes = await text_to_speech(ai_response)

                # Send audio back to the caller
                media_message = {
                    "event": "media",
                    "streamSid": stream_sid,
                    "media": {
                        "payload": base64.b64encode(audio_bytes).decode()
                    },
                }
                await websocket.send(json.dumps(media_message))

        elif event_type == "stop":
            print("Stream ended")
            break

Media Streams deliver audio as base64-encoded mulaw at 8000 Hz. Your pipeline must decode this, run speech-to-text, generate the AI response, synthesize speech, re-encode to mulaw, and send it back — all within a few hundred milliseconds for natural conversation flow.

Configuring the Webhook on Your Twilio Number

Once your server is running, point your Twilio phone number at it:

phone_number = client.incoming_phone_numbers.list(limit=1)[0]
client.incoming_phone_numbers(phone_number.sid).update(
    voice_url="https://yourdomain.com/incoming-call",
    voice_method="POST",
    status_callback="https://yourdomain.com/call-status",
    status_callback_method="POST",
)
print(f"Webhook configured for {phone_number.phone_number}")

The status_callback URL receives events like call initiation, ringing, answered, and completed — useful for logging and analytics.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Production Considerations

For production deployments, implement these patterns: use connection pooling for your WebSocket server, handle stream reconnections gracefully, implement silence detection to avoid sending empty audio to your STT engine, and add timeout handling for calls where the caller goes silent for too long. Monitor your Twilio usage closely — Media Streams are billed per minute of active streaming.

FAQ

What audio format does Twilio Media Streams use?

Twilio Media Streams deliver audio as base64-encoded mulaw (G.711 u-law) at 8000 Hz, mono channel. When sending audio back, you must encode in the same format. Most speech-to-text engines accept mulaw directly, but text-to-speech output often needs conversion from PCM or MP3 to mulaw before sending.

Can I use Twilio Gather instead of Media Streams for simpler use cases?

Yes. The <Gather> TwiML verb with input="speech" handles speech recognition on Twilio's side and delivers transcribed text to your webhook. This is simpler to implement but adds latency (typically 1-3 seconds) and does not support real-time interruption. Use Gather for simple menu navigation and Media Streams for conversational AI.

How do I handle concurrent calls on the same Twilio number?

Twilio automatically handles concurrent calls — each call gets its own webhook request and its own WebSocket stream. Your server must be stateless per-call (use the CallSid as a session key) and handle concurrency. In production, run multiple server instances behind a load balancer and use Redis or similar for shared call state.


#Twilio #VoiceAI #Telephony #MediaStreams #Python #AIAgents #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Infrastructure

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

WebTransport is Baseline as of March 2026. Media Over QUIC ships in production within the year. Here is what changes for AI voice agents — and what stays the same.

AI Infrastructure

Defense, ITAR & AI Voice Vendor Compliance in 2026

ITAR technical-data definitions don't care if a human or an LLM produced the output. CMMC Level 2 has been mandatory since November 2025. Here is what an AI voice vendor needs to ship to defense in 2026.

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

Agentic AI

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.

Agentic AI

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

Use LangGraph's checkpointer to make agents resumable across crashes and human-in-the-loop pauses, then replay any checkpoint into your eval pipeline.

Agentic AI

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

How LangGraph's StateGraph, channels, and reducers actually work — with a working multi-step agent, eval hooks at every node, and the patterns that survive production.