Building Voice Agents with the OpenAI Realtime API: Full Tutorial
Hands-on tutorial for building voice agents with the OpenAI Realtime API — WebSocket setup, PCM16 audio, server VAD, and function calling.
Why this API changed the playbook
Before the Realtime API, building a voice agent meant wiring together Whisper (or Deepgram), an LLM, and a TTS service over three separate connections, then fighting a constant battle with latency and interruption handling. The Realtime API collapses all three into one WebSocket that streams audio in and audio out and surfaces a clean event model for interruptions and tool calls.
This is a hands-on tutorial for building a working voice agent on top of the Realtime API. It does not assume a telephony provider — you can run everything locally with a laptop microphone first, then swap in Twilio later.
mic ──PCM16──► Realtime API ──PCM16──► speaker
│
├── session.created
├── input_audio_buffer.speech_started
├── response.audio.delta
├── response.function_call_arguments.done
└── response.done
Architecture overview
┌───────────────────────────────┐
│ Node.js client │
│ • sounddevice / portaudio │
│ • WebSocket to Realtime API │
│ • tool dispatcher │
└───────────────┬───────────────┘
│
▼
┌───────────────────────────────┐
│ OpenAI Realtime API │
│ gpt-4o-realtime-preview- │
│ 2025-06-03 │
└───────────────────────────────┘
Prerequisites
- Node.js 20+ or Python 3.11+.
- An OpenAI API key with Realtime access.
- PortAudio (macOS:
brew install portaudio, Linux:apt install libportaudio2). - Basic familiarity with WebSocket events.
Step-by-step walkthrough
1. Open the WebSocket and configure the session
import WebSocket from "ws";
const ws = new WebSocket(
"wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview-2025-06-03",
{
headers: {
Authorization: "Bearer " + process.env.OPENAI_API_KEY,
"OpenAI-Beta": "realtime=v1",
},
},
);
ws.on("open", () => {
ws.send(JSON.stringify({
type: "session.update",
session: {
voice: "alloy",
instructions: "You are a friendly receptionist for Acme Clinic.",
input_audio_format: "pcm16",
output_audio_format: "pcm16",
turn_detection: { type: "server_vad", silence_duration_ms: 400, threshold: 0.5 },
tools: [
{
type: "function",
name: "check_availability",
description: "Check provider availability",
parameters: {
type: "object",
properties: {
provider_id: { type: "string" },
date: { type: "string", description: "YYYY-MM-DD" },
},
required: ["provider_id", "date"],
},
},
],
},
}));
});
2. Stream microphone audio
import { spawn } from "child_process";
// arecord pipes PCM16 at 24kHz mono to stdout
const mic = spawn("arecord", ["-q", "-f", "S16_LE", "-r", "24000", "-c", "1", "-t", "raw"]);
mic.stdout.on("data", (chunk) => {
ws.send(JSON.stringify({
type: "input_audio_buffer.append",
audio: chunk.toString("base64"),
}));
});
3. Play back the model's audio
import { spawn as spawn2 } from "child_process";
const speaker = spawn2("aplay", ["-q", "-f", "S16_LE", "-r", "24000", "-c", "1"]);
ws.on("message", (raw) => {
const evt = JSON.parse(raw.toString());
if (evt.type === "response.audio.delta") {
speaker.stdin.write(Buffer.from(evt.delta, "base64"));
}
});
4. Handle function calls
ws.on("message", async (raw) => {
const evt = JSON.parse(raw.toString());
if (evt.type === "response.function_call_arguments.done") {
const args = JSON.parse(evt.arguments);
let result: unknown;
if (evt.name === "check_availability") {
result = await checkAvailability(args.provider_id, args.date);
}
ws.send(JSON.stringify({
type: "conversation.item.create",
item: {
type: "function_call_output",
call_id: evt.call_id,
output: JSON.stringify(result),
},
}));
ws.send(JSON.stringify({ type: "response.create" }));
}
});
5. Handle interruptions
When the caller starts speaking mid-response, clear the output buffer and cancel the in-flight response.
flowchart LR
CALLER(["Student or Parent"])
subgraph TEL["Telephony"]
SIP["Twilio SIP and PSTN"]
end
subgraph BRAIN["Education AI Agent"]
STT["Streaming STT<br/>Deepgram or Whisper"]
NLU{"Intent and<br/>Entity Extraction"}
TOOLS["Tool Calls"]
TTS["Streaming TTS<br/>ElevenLabs or Rime"]
end
subgraph DATA["Live Data Plane"]
CRM[("CRM and Notes")]
CAL[("Calendar and<br/>Schedule")]
KB[("Knowledge Base<br/>and Policies")]
end
subgraph OUT["Outcomes"]
O1(["Enrollment captured"])
O2(["Tour scheduled"])
O3(["Counselor callback"])
end
CALLER --> SIP --> STT --> NLU
NLU -->|Lookup| TOOLS
TOOLS <--> CRM
TOOLS <--> CAL
TOOLS <--> KB
NLU --> TTS --> SIP --> CALLER
NLU -->|Resolved| O1
NLU -->|Schedule| O2
NLU -->|Escalate| O3
style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
style O1 fill:#059669,stroke:#047857,color:#fff
style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937
if (evt.type === "input_audio_buffer.speech_started") {
ws.send(JSON.stringify({ type: "response.cancel" }));
}
6. Log the transcript
The Realtime API emits transcript deltas for both sides. Collect them for later analysis.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
if (evt.type === "conversation.item.input_audio_transcription.completed") {
console.log("user:", evt.transcript);
}
if (evt.type === "response.audio_transcript.done") {
console.log("agent:", evt.transcript);
}
Production considerations
- Heartbeats: send a WebSocket ping every 15s to keep the connection alive through proxies.
- Reconnects: on unexpected close, reconnect with exponential backoff and replay the last session config.
- Rate limits: the Realtime API has concurrent session limits per org. Monitor and scale your quota.
- Cost: charge by input/output audio minute. Hang up on silence aggressively.
- PII: the transcript contains everything callers say. Encrypt at rest and scope access.
CallSphere's real implementation
CallSphere uses the OpenAI Realtime API with gpt-4o-realtime-preview-2025-06-03 as the core of its voice and chat agents. Server VAD is on, audio is PCM16 at 24kHz, and every vertical ships its own tool schema: 14 tools for healthcare (insurance verification, appointment booking, provider lookup, and more), 10 agents for real estate, 4 for salon, 7 for after-hours escalation, 10 plus RAG for IT helpdesk, and an ElevenLabs TTS pod with 5 GPT-4 specialists for sales.
Multi-agent handoffs run through the OpenAI Agents SDK so a single caller can be routed from a triage agent to a specialist mid-call without dropping audio. Post-call analytics are handled by a GPT-4o-mini pipeline that writes sentiment, intent, and lead score into per-vertical Postgres. CallSphere supports 57+ languages and keeps end-to-end response time under one second.
Common pitfalls
- Wrong sample rate: 16kHz audio will work but degrade quality; stick to 24kHz.
- Not handling function_call_arguments.done: you will miss tool calls.
- Pushing audio faster than realtime: the API expects near-realtime ingest; bursty pushes confuse VAD.
- Ignoring response.done: you lose the end-of-turn signal.
- No reconnect logic: the socket will drop eventually; plan for it.
FAQ
Can I use this with a phone number?
Yes — bridge Twilio Media Streams to your WebSocket server and forward audio in both directions.
What is the difference between server VAD and client VAD?
Server VAD runs on OpenAI's side and generates speech_started events automatically. Client VAD lets you control turn-taking manually. Start with server VAD.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
How do I change the voice mid-call?
Send another session.update with the new voice name. Do it between turns, not during a response.
Does it support streaming function outputs back?
Yes — once you send the function_call_output item, the model picks up and continues speaking.
Can I use multiple tools in one turn?
Yes. The model can emit multiple tool calls, and you should respond to each before calling response.create.
Next steps
Want to see a full Realtime API deployment in production? Book a demo, explore the technology page, or browse pricing.
#CallSphere #OpenAIRealtime #VoiceAI #Tutorial #WebSocket #FunctionCalling #AIVoiceAgents
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.