Building Real-Time Voice Agents with OpenAI Realtime API and WebRTC in 2026

Why the OpenAI Realtime API Changes Voice Agent Development

Before the Realtime API, building a voice agent required stitching together three separate services: a speech-to-text provider, an LLM for reasoning, and a text-to-speech provider. Each hop added 200-400ms of latency. A typical pipeline hit 1.2-2 seconds of total response time — noticeable enough to break conversational flow.

The OpenAI Realtime API collapses this into a single WebSocket or WebRTC connection. Raw audio goes in, reasoned audio comes out. The model handles speech recognition, reasoning, and speech synthesis internally using GPT-4o's multimodal capabilities. Total response latency drops to 300-500ms, which falls within the range of natural human conversation pauses.

This tutorial walks through building a production voice agent from scratch using the Realtime API with WebRTC for browser-based interactions and Twilio for telephone integration.

Architecture Overview

The system has three components: a browser client using WebRTC, a backend server that manages sessions and ephemeral tokens, and the OpenAI Realtime API endpoint.

flowchart LR
    CALLER(["Caller"])
    subgraph TEL["Telephony"]
        SIP["Twilio SIP and PSTN"]
    end
    subgraph BRAIN["Business AI Agent"]
        STT["Streaming STT<br/>Deepgram or Whisper"]
        NLU{"Intent and<br/>Entity Extraction"}
        TOOLS["Tool Calls"]
        TTS["Streaming TTS<br/>ElevenLabs or Rime"]
    end
    subgraph DATA["Live Data Plane"]
        CRM[("CRM and Notes")]
        CAL[("Calendar and<br/>Schedule")]
        KB[("Knowledge Base<br/>and Policies")]
    end
    subgraph OUT["Outcomes"]
        O1(["Booking captured"])
        O2(["CRM record created"])
        O3(["Human handoff"])
    end
    CALLER --> SIP --> STT --> NLU
    NLU -->|Lookup| TOOLS
    TOOLS <--> CRM
    TOOLS <--> CAL
    TOOLS <--> KB
    NLU --> TTS --> SIP --> CALLER
    NLU -->|Resolved| O1
    NLU -->|Schedule| O2
    NLU -->|Escalate| O3
    style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
    style O1 fill:#059669,stroke:#047857,color:#fff
    style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
    style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937

// Architecture flow:
// Browser (WebRTC) <-> OpenAI Realtime API (gpt-4o-realtime)
//                          |
//                     Function calls
//                          |
//                   Your Backend Server
//                   (tool execution, DB, etc.)

WebRTC provides the transport layer. The browser captures microphone audio, sends it to OpenAI's servers via a peer connection, and receives synthesized audio back. Your backend server handles ephemeral token generation and tool execution when the model calls functions.

Step 1: Generate an Ephemeral Token

Never expose your OpenAI API key to the browser. Instead, create a short-lived ephemeral token on your backend.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

// server/routes/session.ts
import express from "express";

const router = express.Router();

router.post("/api/session", async (req, res) => {
  const { voice = "alloy", instructions } = req.body;

  try {
    const response = await fetch(
      "https://api.openai.com/v1/realtime/sessions",
      {
        method: "POST",
        headers: {
          Authorization: `Bearer ${process.env.OPENAI_API_KEY}`,
          "Content-Type": "application/json",
        },
        body: JSON.stringify({
          model: "gpt-4o-realtime-preview-2026-01-21",
          voice,
          modalities: ["text", "audio"],
          instructions:
            instructions ||
            "You are a helpful customer service agent for CallSphere. " +
            "Be concise and professional. Ask clarifying questions when needed.",
          turn_detection: {
            type: "server_vad",
            threshold: 0.5,
            prefix_padding_ms: 300,
            silence_duration_ms: 600,
          },
          tools: [
            {
              type: "function",
              name: "lookup_customer",
              description: "Look up a customer by phone number or account ID",
              parameters: {
                type: "object",
                properties: {
                  phone: { type: "string", description: "Customer phone number" },
                  account_id: { type: "string", description: "Account ID" },
                },
              },
            },
            {
              type: "function",
              name: "schedule_appointment",
              description: "Schedule an appointment for the customer",
              parameters: {
                type: "object",
                properties: {
                  customer_id: { type: "string" },
                  date: { type: "string", description: "ISO 8601 date" },
                  time: { type: "string", description: "HH:MM format" },
                  service_type: { type: "string" },
                },
                required: ["customer_id", "date", "time", "service_type"],
              },
            },
          ],
        }),
      }
    );

    const data = await response.json();
    // data.client_secret.value contains the ephemeral token
    res.json({
      token: data.client_secret.value,
      expires_at: data.client_secret.expires_at,
    });
  } catch (error) {
    console.error("Session creation failed:", error);
    res.status(500).json({ error: "Failed to create session" });
  }
});

export default router;

The ephemeral token expires after 60 seconds — enough time for the browser to establish the WebRTC connection, after which the token is no longer needed.

Step 2: Establish the WebRTC Connection

On the browser side, use the ephemeral token to create a peer connection directly to OpenAI.

// client/voice-agent.ts
class VoiceAgent {
  private pc: RTCPeerConnection | null = null;
  private dc: RTCDataChannel | null = null;
  private audioElement: HTMLAudioElement;

  constructor() {
    this.audioElement = document.createElement("audio");
    this.audioElement.autoplay = true;
  }

  async connect(): Promise<void> {
    // Step 1: Get ephemeral token from our backend
    const sessionRes = await fetch("/api/session", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        voice: "alloy",
        instructions: "You are a helpful voice assistant.",
      }),
    });
    const { token } = await sessionRes.json();

    // Step 2: Create peer connection
    this.pc = new RTCPeerConnection();

    // Step 3: Set up audio playback for model responses
    this.pc.ontrack = (event) => {
      this.audioElement.srcObject = event.streams[0];
    };

    // Step 4: Capture microphone and add track
    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    stream.getTracks().forEach((track) => {
      this.pc!.addTrack(track, stream);
    });

    // Step 5: Create data channel for events (function calls, transcripts)
    this.dc = this.pc.createDataChannel("oai-events");
    this.dc.onmessage = (event) => this.handleServerEvent(JSON.parse(event.data));

    // Step 6: Create and set local offer
    const offer = await this.pc.createOffer();
    await this.pc.setLocalDescription(offer);

    // Step 7: Send offer to OpenAI, get answer
    const sdpResponse = await fetch(
      "https://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2026-01-21",
      {
        method: "POST",
        headers: {
          Authorization: `Bearer ${token}`,
          "Content-Type": "application/sdp",
        },
        body: offer.sdp,
      }
    );

    const answerSdp = await sdpResponse.text();
    await this.pc.setRemoteDescription({ type: "answer", sdp: answerSdp });

    console.log("WebRTC connection established");
  }

  private handleServerEvent(event: any): void {
    switch (event.type) {
      case "response.function_call_arguments.done":
        this.executeFunction(event);
        break;
      case "conversation.item.input_audio_transcription.completed":
        console.log("User said:", event.transcript);
        break;
      case "response.audio_transcript.done":
        console.log("Agent said:", event.transcript);
        break;
      case "error":
        console.error("Realtime API error:", event.error);
        break;
    }
  }

  private async executeFunction(event: any): void {
    const { name, arguments: args, call_id } = event;
    let result: any;

    try {
      // Execute the function on your backend
      const response = await fetch(`/api/tools/${name}`, {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: args,
      });
      result = await response.json();
    } catch (error) {
      result = { error: "Tool execution failed" };
    }

    // Send the result back through the data channel
    this.dc?.send(
      JSON.stringify({
        type: "conversation.item.create",
        item: {
          type: "function_call_output",
          call_id,
          output: JSON.stringify(result),
        },
      })
    );

    // Trigger the model to continue responding
    this.dc?.send(JSON.stringify({ type: "response.create" }));
  }

  disconnect(): void {
    this.dc?.close();
    this.pc?.close();
    this.pc = null;
    this.dc = null;
  }
}

Step 3: Server VAD Configuration

Server-side Voice Activity Detection (VAD) is what makes the conversation feel natural. The model listens for speech, detects when the user stops talking, and automatically generates a response.

The three critical VAD parameters are:

threshold (0.0-1.0): Sensitivity for detecting speech. Lower values detect quieter speech but increase false positives from background noise. Default 0.5 works for most environments.
prefix_padding_ms: How many milliseconds of audio before detected speech to include. 300ms captures the beginning of words that might otherwise be clipped.
silence_duration_ms: How long the user must be silent before the model considers the turn complete. 500-700ms is the sweet spot — shorter causes premature cutoffs, longer feels sluggish.

# Python example: Tuning VAD for different environments
vad_configs = {
    "quiet_office": {
        "type": "server_vad",
        "threshold": 0.4,
        "prefix_padding_ms": 200,
        "silence_duration_ms": 500,
    },
    "noisy_call_center": {
        "type": "server_vad",
        "threshold": 0.7,
        "prefix_padding_ms": 400,
        "silence_duration_ms": 700,
    },
    "phone_line": {
        "type": "server_vad",
        "threshold": 0.5,
        "prefix_padding_ms": 300,
        "silence_duration_ms": 600,
    },
}

Step 4: Twilio Integration for Phone Calls

For telephone-based voice agents, Twilio provides the bridge between PSTN phone calls and your WebSocket-based voice agent. The flow is: caller dials your Twilio number, Twilio opens a WebSocket media stream to your server, your server relays audio between Twilio and OpenAI.

# server/twilio_handler.py
import json
import base64
import asyncio
import websockets
from fastapi import FastAPI, WebSocket
from twilio.twiml.voice_response import VoiceResponse, Connect

app = FastAPI()

OPENAI_WS_URL = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2026-01-21"

@app.post("/twilio/incoming")
async def handle_incoming_call():
    """Twilio webhook: return TwiML that connects to our WebSocket."""
    response = VoiceResponse()
    connect = Connect()
    connect.stream(
        url=f"wss://{os.environ['SERVER_HOST']}/twilio/media-stream"
    )
    response.append(connect)
    return str(response)

@app.websocket("/twilio/media-stream")
async def media_stream(ws: WebSocket):
    """Bridge between Twilio media stream and OpenAI Realtime API."""
    await ws.accept()

    headers = {
        "Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}",
        "OpenAI-Beta": "realtime=v1",
    }

    async with websockets.connect(OPENAI_WS_URL, extra_headers=headers) as openai_ws:
        stream_sid = None

        # Configure the session
        await openai_ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "voice": "alloy",
                "instructions": "You are a phone-based customer service agent.",
                "input_audio_format": "g711_ulaw",
                "output_audio_format": "g711_ulaw",
                "turn_detection": {
                    "type": "server_vad",
                    "threshold": 0.5,
                    "silence_duration_ms": 600,
                },
            },
        }))

        async def relay_twilio_to_openai():
            """Forward Twilio audio to OpenAI."""
            nonlocal stream_sid
            async for message in ws.iter_text():
                data = json.loads(message)
                if data["event"] == "media":
                    await openai_ws.send(json.dumps({
                        "type": "input_audio_buffer.append",
                        "audio": data["media"]["payload"],
                    }))
                elif data["event"] == "start":
                    stream_sid = data["start"]["streamSid"]

        async def relay_openai_to_twilio():
            """Forward OpenAI audio to Twilio."""
            async for message in openai_ws:
                event = json.loads(message)
                if event["type"] == "response.audio.delta":
                    await ws.send_json({
                        "event": "media",
                        "streamSid": stream_sid,
                        "media": {"payload": event["delta"]},
                    })
                elif event["type"] == "response.function_call_arguments.done":
                    result = await execute_tool(event["name"], event["arguments"])
                    await openai_ws.send(json.dumps({
                        "type": "conversation.item.create",
                        "item": {
                            "type": "function_call_output",
                            "call_id": event["call_id"],
                            "output": json.dumps(result),
                        },
                    }))
                    await openai_ws.send(json.dumps({"type": "response.create"}))

        await asyncio.gather(
            relay_twilio_to_openai(),
            relay_openai_to_twilio(),
        )

Note the audio format: Twilio uses G.711 u-law encoding, so you must set input_audio_format and output_audio_format to g711_ulaw. The Realtime API handles the conversion internally.

Step 5: Handling Interruptions

Natural conversations involve interruptions. The Realtime API handles this through the response.cancel event. When server VAD detects the user speaking while the model is generating audio, it automatically truncates the current response.

Your client needs to handle the truncation gracefully:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

// In handleServerEvent:
case "response.audio.done":
  // Response completed normally
  this.updateUI({ status: "listening" });
  break;

case "input_audio_buffer.speech_started":
  // User started speaking — model will auto-truncate if responding
  this.updateUI({ status: "user_speaking" });
  break;

case "response.cancelled":
  // Model response was interrupted by user speech
  console.log("Response interrupted by user");
  break;

Production Considerations

Connection resilience: WebRTC connections drop. Implement automatic reconnection with exponential backoff. Cache the conversation history so the agent can resume context after reconnection.

Audio quality monitoring: Track audio levels and report silence or noise issues. A microphone that stops sending audio should trigger a user prompt, not silent confusion.

Cost management: The Realtime API bills per audio minute for both input and output. Implement idle timeout detection — if no speech is detected for 30 seconds, prompt the user or end the session.

Logging and compliance: For regulated industries, capture both the audio stream and the transcript. The Realtime API provides transcript events that you can log without additional STT costs.

FAQ

What is the latency difference between the WebRTC and WebSocket approaches?

WebRTC provides lower and more consistent latency because it uses UDP-based transport optimized for real-time media. Typical round-trip latency with WebRTC is 300-500ms. The WebSocket approach adds 100-200ms due to TCP overhead and the need to manually handle audio chunking. For browser-based applications, WebRTC is the recommended approach.

Can I use the Realtime API with non-English languages?

Yes. The GPT-4o Realtime model supports over 50 languages for both input and output audio. Set the language in the session instructions. Performance is strongest in English, Spanish, French, German, Japanese, and Mandarin. Less common languages may have higher word error rates.

How do I handle function calls that take more than a few seconds?

For long-running tools, send an intermediate response before the tool completes. You can use the conversation.item.create event to inject a message like "Let me look that up for you" while the tool executes. This prevents awkward silence during database queries or API calls that take 2-5 seconds.

What happens when the WebRTC connection drops mid-conversation?

The connection is lost and the session ends. You need to implement reconnection logic on the client side: detect the disconnect via pc.onconnectionstatechange, request a new ephemeral token, re-establish the WebRTC connection, and optionally replay conversation context. The Realtime API does not persist sessions across connections, so your backend should maintain conversation state.

#OpenAIRealtime #WebRTC #VoiceAgents #RealTimeAI #Twilio #ConversationalAI #VoiceDev

Building Real-Time Voice Agents with OpenAI Realtime API and WebRTC in 2026

Why the OpenAI Realtime API Changes Voice Agent Development

Architecture Overview

Step 1: Generate an Ephemeral Token

Step 2: Establish the WebRTC Connection

Step 3: Server VAD Configuration

Step 4: Twilio Integration for Phone Calls

Step 5: Handling Interruptions

Production Considerations

FAQ

What is the latency difference between the WebRTC and WebSocket approaches?

Can I use the Realtime API with non-English languages?

How do I handle function calls that take more than a few seconds?

What happens when the WebRTC connection drops mid-conversation?

Try CallSphere AI Voice Agents

Related Articles You May Like

WebRTC Mobile Testing with BrowserStack + Sauce Labs (2026)

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

Voice Agent Quality Metrics in 2026: WER, Latency, Grounding, and the Ones Most Teams Miss

Online vs Offline Agent Evaluation: The Pre-Deploy / Post-Deploy Split

Building OpenAI Realtime Voice Agents with an Eval Pipeline (2026)

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real