Twilio TwiML Stream Deep Dive: Bidirectional Media for AI Voice in 2026
Twilio's <Connect><Stream> verb is the load-bearing primitive behind 80%+ of production AI voice in 2026. Mark and Clear events for barge-in, mulaw 8 kHz one-way at base, and a hard 1-stream-per-call limit. Here is how to build on it.
Twilio Media Streams started life in 2019 as a one-way stream-out feature. Bidirectional
went GA in 2023, and as of 2026 it is the substrate underneath ConversationRelay and probably 80% of every Twilio-fronted AI voice product. The format is simple, the constraints are real, and once you understand Mark and Clear events, barge-in becomes a one-line change.
Background
Twilio Programmable Voice lets you control calls with TwiML, an XML markup with verbs like
<Start><Stream> is the older one-way variant; <Connect><Stream> is bidirectional and blocks subsequent TwiML until the WebSocket disconnects. The bidirectional version added Mark and Clear events: Mark lets you tag a position in your sent audio buffer and get a confirmation when Twilio plays past it; Clear empties Twilio's outbound buffer for instant interruption when the caller starts speaking.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
The 8 kHz mulaw default is the friction point. OpenAI Realtime accepts G.711 directly, so for many builders Twilio's native format is fine end-to-end. For better quality you transcode upstream to 16 kHz L16 or Opus.
Architecture
graph LR
A[Caller PSTN] --> B[Twilio Voice]
B -->|mulaw 8k 20ms frames| C[Your WebSocket Server]
C -->|JSON media events| D[Audio Decoder]
D -->|L16 16k| E[OpenAI Realtime]
E -->|Opus or PCM back| F[Audio Encoder]
F -->|mulaw 8k frames| C
C -->|JSON media + mark + clear| B
B --> A
<Response>
<Connect>
<Stream url="wss://bridge.callsphere.ai/realtime"
track="inbound_track"
statusCallback="https://callsphere.ai/api/twilio/stream-status">
<Parameter name="tenant" value="abc123"/>
<Parameter name="agent" value="healthcare-intake"/>
</Stream>
</Connect>
</Response>
// Outbound media event from your server to Twilio (base64 mulaw)
{"event":"media","streamSid":"MZxx","media":{"payload":"PT4+Pj4..."}}
// Mark to track playback position
{"event":"mark","streamSid":"MZxx","mark":{"name":"utterance-42-end"}}
// Clear to interrupt currently buffered audio (barge-in)
{"event":"clear","streamSid":"MZxx"}
CallSphere implementation
CallSphere uses TwiML
Build steps
- Allocate a TwiML endpoint that returns the
response with your WebSocket URL. - Build the WebSocket handler: accept connection, parse start event for streamSid and parameters, then loop on media events.
- Decode mulaw 8 kHz to L16 16 kHz before sending to OpenAI Realtime; Twilio frames are 160 bytes of mulaw = 20 ms = 160 samples after expansion, upsample to 320 samples L16.
- Encode model output back to mulaw 8 kHz; chunk into 20 ms frames; send as media events with the streamSid.
- Send Mark events at sentence boundaries; OpenAI sends response.audio.delta events that you align with marks.
- On speech_started, send Clear event immediately to flush Twilio's outbound buffer for natural interruption.
- Monitor statusCallback for stream-failed and stream-stopped to clean up server-side state.
Pitfalls
- One
per call. Cannot fork to two AI services; must demux server-side. - DTMF inbound only (caller-to-server). Cannot send DTMF outbound from server through
. - Mulaw payload base64-encoded in JSON; if you forget to base64-decode, you stream garbage and the model says "Hello, hello, are you there?" forever.
- Clear events take ~50 ms to take effect; do not assume instant flush.
- Bidirectional streams have a 30-second idle timeout; send keepalive media frames or expect disconnects.
FAQ
Should I use ConversationRelay instead of Streams for AI?
ConversationRelay packages STT, LLM, TTS into one TwiML verb. Less control, faster build.
What is the latency of a Twilio bidirectional Stream? 20-60 ms for the Twilio leg, plus your server hop, plus the model. End-to-end voice-to-voice 600-900 ms is typical with OpenAI Realtime.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Is mulaw lossy enough to hurt ASR? For Whisper and Deepgram on names and digits, yes; ~3-5% absolute WER hit vs G.722 wideband. Transcode upstream if your trunk supports it.
Can I record a Stream call? Yes via Twilio's separate recording API; the Stream itself does not store audio.
Mark vs Clear: when do I use which? Mark for tracking playback progress (used to align tool calls with what the user already heard). Clear for barge-in interruption.
Sources
- Twilio Media Streams Overview
- Twilio TwiML Stream verb reference
- Bi-directional Streaming changelog
- WebSocket Messages reference
Start a 14-day trial on our Twilio-powered stack, see pricing for $149/$499/$1499, or book a demo to hear barge-in latency in production.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.