Build a Voice Agent with Vercel AI SDK + Twilio (2026)
Use the Vercel AI SDK's transcription/speech functions plus Twilio Media Streams to ship a voice agent on Vercel Edge Runtime. Real Next.js Route Handler, working code, deploy in 5 min.
TL;DR — Vercel AI SDK 5 ships
transcribe()andspeak()as first-class functions across providers, plus thegenerateText/streamTextagent loop. Combined with Twilio Media Streams over WebSockets in a Next.js Route Handler running on Node runtime (not Edge for WS), you get a voice agent deployed to Vercel in five minutes.
What you'll build
A Next.js 15 app with three routes:
POST /api/twilio/voicereturns TwiML pointing at a WS endpointGET /api/twilio/media(WebSocket upgrade) bridges Twilio audio to a sandwich agent- The sandwich uses
transcribe(whisper-1)→streamText(gpt-5)→speak(elevenlabs)
Deployed to Vercel with one push. Twilio webhook hits the production URL.
Prerequisites
- Vercel project + Twilio account with a number.
- Node 20, Next.js 15,
aiv5,@ai-sdk/openai,@ai-sdk/elevenlabs. - Twilio webhook URL set to your Vercel deployment.
OPENAI_API_KEY,ELEVENLABS_API_KEYin Vercel env.
Architecture
flowchart LR
C[Caller] --> T[Twilio]
T -->|HTTP TwiML| API[/api/twilio/voice]
API -->|<Connect><Stream>| T
T -->|wss media| WS[/api/twilio/media]
WS -->|transcribe| W[Whisper]
W -->|text| LLM[streamText gpt-5]
LLM -->|text| TTS[speak ElevenLabs]
TTS --> WS
WS --> T --> C
Step 1 — TwiML route
```ts
// app/api/twilio/voice/route.ts
export async function POST(req: Request) {
const host = req.headers.get("host");
const xml = `
Step 2 — WebSocket Route Handler (Node runtime)
```ts // app/api/twilio/media/route.ts export const runtime = "nodejs"; import { WebSocketServer } from "ws"; import { transcribe, generateText, experimental_generateSpeech as speak } from "ai"; import { openai } from "@ai-sdk/openai"; import { elevenlabs } from "@ai-sdk/elevenlabs";
let wss: WebSocketServer | null = null; function init(server: any) { if (wss) return; wss = new WebSocketServer({ server, path: "/api/twilio/media" }); wss.on("connection", handleConn); }
async function handleConn(ws: any) { const buffer: Buffer[] = []; let streamSid = ""; ws.on("message", async (raw: any) => { const ev = JSON.parse(raw.toString()); if (ev.event === "start") streamSid = ev.streamSid; if (ev.event === "media") { buffer.push(Buffer.from(ev.media.payload, "base64")); if (buffer.length > 50) await respond(ws, streamSid, buffer.splice(0)); } }); }
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
async function respond(ws: any, sid: string, frames: Buffer[]) { const wav = mulawToWav(Buffer.concat(frames)); const { text } = await transcribe({ model: openai.transcription("whisper-1"), audio: wav }); if (!text) return; const { text: reply } = await generateText({ model: openai("gpt-5"), system: "You are a friendly receptionist. Reply in one sentence.", prompt: text, }); const audio = await speak({ model: elevenlabs.speech("eleven_turbo_v2_5"), text: reply }); for (const chunk of chunked(audio.audio, 320)) { ws.send(JSON.stringify({ event: "media", streamSid: sid, media: { payload: pcmToMulaw(chunk).toString("base64") } })); } } ```
Step 3 — Mu-law transcoding helpers
Twilio sends mu-law 8kHz; Whisper wants WAV/PCM. ElevenLabs returns PCM 16kHz; convert to mu-law 8kHz before sending back.
```ts import { Readable } from "stream"; function mulawToWav(mulaw: Buffer): Buffer { // 44-byte WAV header + linear PCM 8kHz mono // (use pcm-util or write manually) return makeWavHeader(8000, decodeMulaw(mulaw)); } ```
For production, use @scramjet/audio-utils or a small WASM transcoder.
Step 4 — Hot-path streaming with streamText + sentence-by-sentence TTS
Replace generateText with streamText and chunk on sentence boundaries to start TTS earlier:
```ts const { textStream } = await streamText({ model: openai("gpt-5"), prompt: text }); let buf = ""; for await (const delta of textStream) { buf += delta; const m = buf.match(/^([^.!?]+[.!?])\s*/); if (m) { speakAndSend(m[1]); buf = buf.slice(m[0].length); } } ```
Cuts perceived latency by ~40%.
Step 5 — Twilio webhook config
In the Twilio console → your number → A Call Comes In → Webhook → https://your-app.vercel.app/api/twilio/voice. Done.
Step 6 — Deploy
```bash vercel --prod ```
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
WebSocket support requires the Hobby+ plan with Functions running on Node runtime, not Edge. Set runtime: "nodejs" in the route file.
Step 7 — Add tools (function calling)
```ts import { tool } from "ai"; import { z } from "zod";
const tools = { lookup_appointment: tool({ description: "Get next appointment for a patient", parameters: z.object({ patient_id: z.string() }), execute: async ({ patient_id }) => fetchAppt(patient_id), }) }; const { text: reply } = await generateText({ model: openai("gpt-5"), tools, prompt: text }); ```
Pitfalls
- Edge runtime can't open outbound WS — must use Node runtime for the Twilio WS endpoint.
- Vercel function timeout is 60s on Hobby; voice calls are longer. Bump to Pro (300s) or move WS to a long-running service.
- Cold starts: pin Pro to
fluid: truefor warm function pools, or use Vercel's new "Functions / Sandbox" preview that holds connections. - Twilio buffer size: 50 frames ~= 1 second; tune based on barge-in needs.
- AI SDK
experimental_generateSpeech: API name may move out of experimental; pin SDK version.
How CallSphere does this in production
CallSphere's Healthcare voice path uses OpenAI Realtime directly over Twilio (not the sandwich pattern) because Realtime cuts ~400ms vs STT+LLM+TTS chains. We use Vercel only for the marketing site. Our voice agent lives on FastAPI :8084. 37 agents, 90+ tools, 115+ DB tables, 6 verticals, $149/$499/$1499, 14-day trial, 22% affiliate.
FAQ
Q: Why not use Realtime API directly?
You can — replace the sandwich with the Vercel AI SDK's experimental_realtime (in beta May 2026) for native bidirectional. The sandwich pattern is more debuggable, Realtime is faster.
Q: Does this work on Vercel Edge? No, Edge runtime doesn't support WebSocket server upgrades. Use Node runtime.
Q: Latency target? Sandwich pattern: ~1.2s voice-to-voice. Realtime: ~700ms.
Q: ElevenLabs vs OpenAI TTS? ElevenLabs Turbo v2.5 is ~150ms first-byte vs OpenAI TTS-1 at ~300ms. ElevenLabs voices sound better. Cost: about the same.
Q: How do I add a vector store / RAG?
Use @ai-sdk/openai embeddings + Vercel KV for cheap dev; for prod, point at Pinecone or pg-vector.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.