Skip to content
AI Infrastructure
AI Infrastructure12 min read0 views

Twilio Voice <Stream> Bidirectional Patterns for AI Agents (2026)

Bidirectional Media Streams ship raw mulaw both directions over a single WebSocket. We break down the four patterns CallSphere ships in production: proxy-to-OpenAI, sidecar STT, conference fork, and replay-on-reconnect.

TL;DR — Bidirectional <Stream> is the cleanest path from PSTN to a Realtime LLM. Send mulaw 8 kHz both ways, mark every chunk with a sequence number, and gate barge-in on the mark event — not on the audio buffer.

Background

Twilio's <Stream> verb opens a WebSocket from the call leg to your server. In unidirectional mode you only receive audio (good for transcription). In bidirectional mode (<Stream bidirectional="true">) you can also push base64-encoded mulaw frames back into the call. That second direction is what unlocks AI voice agents — you stream OpenAI Realtime / Deepgram Aura / ElevenLabs output straight onto the PSTN line without a second SIP leg.

Anatomy of a stream:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
  • start event — once per call, contains streamSid, callSid, accountSid, custom parameters.
  • media events — 20 ms mulaw frames, base64-encoded, ~50 per second per direction.
  • mark events — your own labels. Twilio echoes them back when the corresponding outbound audio finishes playing. This is the only reliable barge-in signal.
  • stop event — leg ended.

Architecture / config

flowchart LR
  PSTN[Caller / PSTN] --> TW[Twilio Voice]
  TW -- TwiML &lt;Stream bidirectional&gt; --> WS[wss://yourapp/stream]
  WS -- inbound mulaw --> STT[STT or Realtime API]
  STT --> LLM[LLM turn]
  LLM --> TTS[TTS or Realtime API]
  TTS -- outbound mulaw --> WS
  WS -- &quot;mark&quot; events --> BARGE[Barge-in detector]
  BARGE -- &quot;clear&quot; --> WS

Four patterns we run in production:

  1. Proxy-to-Realtime — your WS server proxies frames straight into OpenAI Realtime over a second WS. ~120 ms median round trip.
  2. Sidecar STT + LLM + TTS — split STT (Deepgram), LLM (Anthropic / OpenAI Chat), TTS (ElevenLabs streaming). Higher latency (~450 ms) but per-stage observability.
  3. Conference fork — call goes into a Twilio <Conference>, you fork audio to your AI stream, and an AI participant is added back via a TwiML App. Useful for AI as 3rd party.
  4. Replay-on-reconnect — buffer last 8 s of inbound + last 4 s of outbound on Redis; on stop followed by a new start with the same callSid, replay so the LLM has continuity.

CallSphere implementation

CallSphere runs Twilio across all six verticals. The Healthcare agent fronts a FastAPI service on port :8084 that proxies bidirectional audio into OpenAI Realtime; Sales runs five concurrent outbound calls per account with separate WS workers; the After-hours agent fires a simultaneous voice call + SMS in a 120-second race. Every leg flows through the same /twilio/stream Fastify route, with streamSid keyed into Postgres for replay.

Stack snapshot:

  • 37 specialized agents · 90+ tools · 115+ DB tables · 6 verticals.
  • HIPAA + SOC 2 — TLS to the WS, mulaw recording opt-in per tenant, BAA covers Twilio + OpenAI.
  • $149 / $499 / $1499 plans · 14-day trial · 22% lifetime affiliate.

Build steps with code

<!-- TwiML returned from your /voice webhook -->
<Response>
  <Connect>
    <Stream url="wss://api.callsphere.ai/twilio/stream" bidirectional="true">
      <Parameter name="tenant_id" value="tnt_123"/>
      <Parameter name="agent" value="healthcare-intake"/>
    </Stream>
  </Connect>
</Response>
// Fastify WS handler — frames inbound, mark-gated barge-in
app.register(websocket);
app.get("/twilio/stream", { websocket: true }, (conn) => {
  let streamSid = "";
  conn.socket.on("message", async (raw) => {
    const evt = JSON.parse(raw.toString());
    if (evt.event === "start") streamSid = evt.start.streamSid;
    if (evt.event === "media") openai.sendAudio(evt.media.payload);
    if (evt.event === "mark" && evt.mark.name === "tts-end") openai.flush();
  });
  openai.on("audio", (b64) => {
    conn.socket.send(JSON.stringify({ event: "media", streamSid, media: { payload: b64 } }));
    conn.socket.send(JSON.stringify({ event: "mark", streamSid, mark: { name: "tts-end" } }));
  });
});

Pitfalls

  • Forgetting bidirectional="true" — you'll silently get one-way audio and waste an afternoon.
  • Not echoing streamSid in outbound media — Twilio drops the frame.
  • Using sample rate 16 kHz<Stream> is mulaw 8 kHz only on PSTN; resample.
  • Treating audio buffer length as barge-in — race condition. Trust mark events.
  • Logging full base64 frames — explodes Datadog cost; log every 200th frame at most.

FAQ

Q: How many bidirectional streams per Twilio account? Default cap is 100 concurrent; raise via support ticket. We run 800 concurrent in production.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Q: Mulaw vs PCM? PSTN is mulaw 8 kHz. Twilio <Stream> does not transcode for you — your TTS must output mulaw or you must resample server-side.

Q: Can I record while streaming? Yes — <Start><Stream/></Start> plus standard <Record> works. Recordings are stored separately.

Q: How do I detect dropped streams? Watch for stop events without prior mark echoes within 5 s. Reconnect with replay buffer.

Q: Latency floor? ~80 ms one-way Twilio→WS in us-east-1. Add LLM + TTS to estimate end-to-end.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

AI Infrastructure

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

WebTransport is Baseline as of March 2026. Media Over QUIC ships in production within the year. Here is what changes for AI voice agents — and what stays the same.

AI Infrastructure

Defense, ITAR & AI Voice Vendor Compliance in 2026

ITAR technical-data definitions don't care if a human or an LLM produced the output. CMMC Level 2 has been mandatory since November 2025. Here is what an AI voice vendor needs to ship to defense in 2026.

AI Strategy

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Q1 2026 saw a record acquisition wave: Aircall bought Vogent (May), Meta acquired Manus and PlayAI, OpenAI closed six deals. The voice AI consolidation phase has begun.

AI Infrastructure

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.

AI Voice Agents

Call Sentiment Time-Series Dashboards for Voice AI in 2026

Sentiment is not a single number per call - it is a curve. The shape (started positive, dropped at minute 4, recovered) tells you what your AI did wrong. Here is the per-utterance sentiment pipeline and the dashboards we ship by vertical.