Twilio Media Streams started life in 2019 as a one-way stream-out feature. Bidirectional went GA in 2023, and as of 2026 it is the substrate underneath ConversationRelay and probably 80% of every Twilio-fronted AI voice product. The format is simple, the constraints are real, and once you understand Mark and Clear events, barge-in becomes a one-line change.

Background

Twilio Programmable Voice lets you control calls with TwiML, an XML markup with verbs like , , , . The noun inside opens a WebSocket from Twilio to your server. Audio flows in both directions: media events carry base64-encoded mulaw 8 kHz 8-bit payloads (160 bytes per 20 ms frame), and your server can send the same format back to be played to the caller.

<Start><Stream> is the older one-way variant; <Connect><Stream> is bidirectional and blocks subsequent TwiML until the WebSocket disconnects. The bidirectional version added Mark and Clear events: Mark lets you tag a position in your sent audio buffer and get a confirmation when Twilio plays past it; Clear empties Twilio's outbound buffer for instant interruption when the caller starts speaking.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

The 8 kHz mulaw default is the friction point. OpenAI Realtime accepts G.711 directly, so for many builders Twilio's native format is fine end-to-end. For better quality you transcode upstream to 16 kHz L16 or Opus.

Architecture

graph LR
    A[Caller PSTN] --> B[Twilio Voice]
    B -->|mulaw 8k 20ms frames| C[Your WebSocket Server]
    C -->|JSON media events| D[Audio Decoder]
    D -->|L16 16k| E[OpenAI Realtime]
    E -->|Opus or PCM back| F[Audio Encoder]
    F -->|mulaw 8k frames| C
    C -->|JSON media + mark + clear| B
    B --> A

<Response>
  <Connect>
    <Stream url="wss://bridge.callsphere.ai/realtime"
            track="inbound_track"
            statusCallback="https://callsphere.ai/api/twilio/stream-status">
      <Parameter name="tenant" value="abc123"/>
      <Parameter name="agent" value="healthcare-intake"/>
    </Stream>
  </Connect>
</Response>

// Outbound media event from your server to Twilio (base64 mulaw)
{"event":"media","streamSid":"MZxx","media":{"payload":"PT4+Pj4..."}}
// Mark to track playback position
{"event":"mark","streamSid":"MZxx","mark":{"name":"utterance-42-end"}}
// Clear to interrupt currently buffered audio (barge-in)
{"event":"clear","streamSid":"MZxx"}

CallSphere implementation

CallSphere uses TwiML as the load-bearing primitive across every product. Healthcare AI calls land on a FastAPI service at port :8084 that proxies the bidirectional stream into OpenAI Realtime over WebSocket; we send Clear events the moment OpenAI's input_audio_buffer.speech_started fires, which gives sub-200ms barge-in. Sales Calling AI fires up to 5 concurrent outbound calls per tenant, each on its own . After-Hours AI uses a different pattern: a with simul call+SMS for 120 seconds. Real Estate AI, Salon AI, IT Helpdesk AI all share the same wiring with per-vertical agent prompts. 37 agents, 90+ tools, 115+ DB tables, HIPAA + SOC 2 attestations, $149/$499/$1499 plans, 14-day trial, 22% affiliate.

Build steps

Allocate a TwiML endpoint that returns the response with your WebSocket URL.
Build the WebSocket handler: accept connection, parse start event for streamSid and parameters, then loop on media events.
Decode mulaw 8 kHz to L16 16 kHz before sending to OpenAI Realtime; Twilio frames are 160 bytes of mulaw = 20 ms = 160 samples after expansion, upsample to 320 samples L16.
Encode model output back to mulaw 8 kHz; chunk into 20 ms frames; send as media events with the streamSid.
Send Mark events at sentence boundaries; OpenAI sends response.audio.delta events that you align with marks.
On speech_started, send Clear event immediately to flush Twilio's outbound buffer for natural interruption.
Monitor statusCallback for stream-failed and stream-stopped to clean up server-side state.

Pitfalls

One per call. Cannot fork to two AI services; must demux server-side.
DTMF inbound only (caller-to-server). Cannot send DTMF outbound from server through .
Mulaw payload base64-encoded in JSON; if you forget to base64-decode, you stream garbage and the model says "Hello, hello, are you there?" forever.
Clear events take ~50 ms to take effect; do not assume instant flush.
Bidirectional streams have a 30-second idle timeout; send keepalive media frames or expect disconnects.

FAQ

Should I use ConversationRelay instead of Streams for AI? ConversationRelay packages STT, LLM, TTS into one TwiML verb. Less control, faster build. wins when you need custom STT/LLM/TTS, multi-modal, or non-OpenAI vendors.

What is the latency of a Twilio bidirectional Stream? 20-60 ms for the Twilio leg, plus your server hop, plus the model. End-to-end voice-to-voice 600-900 ms is typical with OpenAI Realtime.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Is mulaw lossy enough to hurt ASR? For Whisper and Deepgram on names and digits, yes; ~3-5% absolute WER hit vs G.722 wideband. Transcode upstream if your trunk supports it.

Can I record a Stream call? Yes via Twilio's separate recording API; the Stream itself does not store audio.

Mark vs Clear: when do I use which? Mark for tracking playback progress (used to align tool calls with what the user already heard). Clear for barge-in interruption.

Sources

Start a 14-day trial on our Twilio-powered stack, see pricing for $149/$499/$1499, or book a demo to hear barge-in latency in production.

Twilio TwiML Stream Deep Dive: Bidirectional Media for AI Voice in 2026

Background

Architecture

CallSphere implementation

Build steps

Pitfalls

FAQ

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

MOS Call Quality Scoring for AI Voice Operations in 2026: Beyond 4.2

Deploy a Voice Agent on fly.io with Multi-Region Routing

Voicemail Detection Accuracy: CallSphere vs Vapi (with Examples)

DTMF Handling for Voice Agents: CallSphere vs Vapi Reliability

WebRTC vs WebSocket Voice: CallSphere Architecture Edge Over Vapi

Twilio Conversational Intelligence vs Custom AI Voice Stacks