Skip to content
AI Voice Agents
AI Voice Agents11 min read0 views

Build a Voice Agent with Vercel AI SDK + Twilio (2026)

Use the Vercel AI SDK's transcription/speech functions plus Twilio Media Streams to ship a voice agent on Vercel Edge Runtime. Real Next.js Route Handler, working code, deploy in 5 min.

TL;DR — Vercel AI SDK 5 ships transcribe() and speak() as first-class functions across providers, plus the generateText / streamText agent loop. Combined with Twilio Media Streams over WebSockets in a Next.js Route Handler running on Node runtime (not Edge for WS), you get a voice agent deployed to Vercel in five minutes.

What you'll build

A Next.js 15 app with three routes:

  • POST /api/twilio/voice returns TwiML pointing at a WS endpoint
  • GET /api/twilio/media (WebSocket upgrade) bridges Twilio audio to a sandwich agent
  • The sandwich uses transcribe(whisper-1)streamText(gpt-5)speak(elevenlabs)

Deployed to Vercel with one push. Twilio webhook hits the production URL.

Prerequisites

  1. Vercel project + Twilio account with a number.
  2. Node 20, Next.js 15, ai v5, @ai-sdk/openai, @ai-sdk/elevenlabs.
  3. Twilio webhook URL set to your Vercel deployment.
  4. OPENAI_API_KEY, ELEVENLABS_API_KEY in Vercel env.

Architecture

flowchart LR
  C[Caller] --> T[Twilio]
  T -->|HTTP TwiML| API[/api/twilio/voice]
  API -->|<Connect><Stream>| T
  T -->|wss media| WS[/api/twilio/media]
  WS -->|transcribe| W[Whisper]
  W -->|text| LLM[streamText gpt-5]
  LLM -->|text| TTS[speak ElevenLabs]
  TTS --> WS
  WS --> T --> C

Step 1 — TwiML route

```ts // app/api/twilio/voice/route.ts export async function POST(req: Request) { const host = req.headers.get("host"); const xml = ` `; return new Response(xml, { headers: { "content-type": "text/xml" } }); } ```

Step 2 — WebSocket Route Handler (Node runtime)

```ts // app/api/twilio/media/route.ts export const runtime = "nodejs"; import { WebSocketServer } from "ws"; import { transcribe, generateText, experimental_generateSpeech as speak } from "ai"; import { openai } from "@ai-sdk/openai"; import { elevenlabs } from "@ai-sdk/elevenlabs";

let wss: WebSocketServer | null = null; function init(server: any) { if (wss) return; wss = new WebSocketServer({ server, path: "/api/twilio/media" }); wss.on("connection", handleConn); }

async function handleConn(ws: any) { const buffer: Buffer[] = []; let streamSid = ""; ws.on("message", async (raw: any) => { const ev = JSON.parse(raw.toString()); if (ev.event === "start") streamSid = ev.streamSid; if (ev.event === "media") { buffer.push(Buffer.from(ev.media.payload, "base64")); if (buffer.length > 50) await respond(ws, streamSid, buffer.splice(0)); } }); }

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

async function respond(ws: any, sid: string, frames: Buffer[]) { const wav = mulawToWav(Buffer.concat(frames)); const { text } = await transcribe({ model: openai.transcription("whisper-1"), audio: wav }); if (!text) return; const { text: reply } = await generateText({ model: openai("gpt-5"), system: "You are a friendly receptionist. Reply in one sentence.", prompt: text, }); const audio = await speak({ model: elevenlabs.speech("eleven_turbo_v2_5"), text: reply }); for (const chunk of chunked(audio.audio, 320)) { ws.send(JSON.stringify({ event: "media", streamSid: sid, media: { payload: pcmToMulaw(chunk).toString("base64") } })); } } ```

Step 3 — Mu-law transcoding helpers

Twilio sends mu-law 8kHz; Whisper wants WAV/PCM. ElevenLabs returns PCM 16kHz; convert to mu-law 8kHz before sending back.

```ts import { Readable } from "stream"; function mulawToWav(mulaw: Buffer): Buffer { // 44-byte WAV header + linear PCM 8kHz mono // (use pcm-util or write manually) return makeWavHeader(8000, decodeMulaw(mulaw)); } ```

For production, use @scramjet/audio-utils or a small WASM transcoder.

Step 4 — Hot-path streaming with streamText + sentence-by-sentence TTS

Replace generateText with streamText and chunk on sentence boundaries to start TTS earlier:

```ts const { textStream } = await streamText({ model: openai("gpt-5"), prompt: text }); let buf = ""; for await (const delta of textStream) { buf += delta; const m = buf.match(/^([^.!?]+[.!?])\s*/); if (m) { speakAndSend(m[1]); buf = buf.slice(m[0].length); } } ```

Cuts perceived latency by ~40%.

Step 5 — Twilio webhook config

In the Twilio console → your number → A Call Comes In → Webhook → https://your-app.vercel.app/api/twilio/voice. Done.

Step 6 — Deploy

```bash vercel --prod ```

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

WebSocket support requires the Hobby+ plan with Functions running on Node runtime, not Edge. Set runtime: "nodejs" in the route file.

Step 7 — Add tools (function calling)

```ts import { tool } from "ai"; import { z } from "zod";

const tools = { lookup_appointment: tool({ description: "Get next appointment for a patient", parameters: z.object({ patient_id: z.string() }), execute: async ({ patient_id }) => fetchAppt(patient_id), }) }; const { text: reply } = await generateText({ model: openai("gpt-5"), tools, prompt: text }); ```

Pitfalls

  • Edge runtime can't open outbound WS — must use Node runtime for the Twilio WS endpoint.
  • Vercel function timeout is 60s on Hobby; voice calls are longer. Bump to Pro (300s) or move WS to a long-running service.
  • Cold starts: pin Pro to fluid: true for warm function pools, or use Vercel's new "Functions / Sandbox" preview that holds connections.
  • Twilio buffer size: 50 frames ~= 1 second; tune based on barge-in needs.
  • AI SDK experimental_generateSpeech: API name may move out of experimental; pin SDK version.

How CallSphere does this in production

CallSphere's Healthcare voice path uses OpenAI Realtime directly over Twilio (not the sandwich pattern) because Realtime cuts ~400ms vs STT+LLM+TTS chains. We use Vercel only for the marketing site. Our voice agent lives on FastAPI :8084. 37 agents, 90+ tools, 115+ DB tables, 6 verticals, $149/$499/$1499, 14-day trial, 22% affiliate.

FAQ

Q: Why not use Realtime API directly? You can — replace the sandwich with the Vercel AI SDK's experimental_realtime (in beta May 2026) for native bidirectional. The sandwich pattern is more debuggable, Realtime is faster.

Q: Does this work on Vercel Edge? No, Edge runtime doesn't support WebSocket server upgrades. Use Node runtime.

Q: Latency target? Sandwich pattern: ~1.2s voice-to-voice. Realtime: ~700ms.

Q: ElevenLabs vs OpenAI TTS? ElevenLabs Turbo v2.5 is ~150ms first-byte vs OpenAI TTS-1 at ~300ms. ElevenLabs voices sound better. Cost: about the same.

Q: How do I add a vector store / RAG? Use @ai-sdk/openai embeddings + Vercel KV for cheap dev; for prod, point at Pinecone or pg-vector.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.