Skip to content
AI Voice Agents
AI Voice Agents12 min read0 views

Build a Bun + Hono + OpenAI Realtime Voice Agent on the Edge (2026)

Bun 1.3 + Hono is 2x faster than Node + Express for WebSocket relays. Wire it to gpt-realtime-2 and deploy to Fly.io edge for sub-500ms voice-to-voice in 6 regions.

TL;DR — Bun 1.3 starts in ~25ms cold, Hono is ~14kb, and OpenAI's gpt-realtime-2 (introduced 2026) gives you GPT-5-class reasoning over voice. Combined: a single bun run server.ts ships a 6-region edge voice agent.

What you'll build

A WebSocket relay between browser PCM and OpenAI Realtime, deployed to 6 Fly.io regions with anycast. Browsers get routed to the nearest edge, p95 voice-to-voice ~480ms.

Prerequisites

  1. Bun 1.3+, hono@^4.6.
  2. Fly.io CLI (brew install flyctl) and an OpenAI key.
  3. Domain on Cloudflare for TLS pass-through.

Architecture

flowchart LR
  BR[Browser] --> CF[Cloudflare anycast]
  CF --> FY[Fly edge nearest of 6]
  FY -- WS --> H[Hono relay on Bun]
  H -- WS --> OA[OpenAI Realtime gpt-realtime-2]

Step 1 — server.ts

```ts import { Hono } from "hono"; import { upgradeWebSocket } from "hono/bun";

const app = new Hono(); const URL = "wss://api.openai.com/v1/realtime?model=gpt-realtime-2";

app.get("/ws", upgradeWebSocket(() => { let oa: WebSocket; return { onOpen: (_e, ws) => { oa = new WebSocket(URL, { headers: { Authorization: Bearer ${process.env.OPENAI_API_KEY}, "OpenAI-Beta": "realtime=v1" }, } as any); oa.onopen = () => oa.send(JSON.stringify({ type: "session.update", session: { voice: "verse", turn_detection: { type: "semantic_vad" } } })); oa.onmessage = (m) => ws.send(m.data); }, onMessage: (e) => oa?.readyState === 1 && oa.send(e.data), onClose: () => oa?.close(), }; }));

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

export default { port: 8787, fetch: app.fetch, websocket: { /* bun ws */ } }; ```

Step 2 — Dockerfile

```dockerfile FROM oven/bun:1.3-alpine WORKDIR /app COPY bun.lockb package.json ./ RUN bun install --frozen-lockfile COPY . . EXPOSE 8787 CMD ["bun", "run", "server.ts"] ```

Step 3 — fly.toml

```toml app = "voice-edge" primary_region = "iad" [build] [http_service] internal_port = 8787 force_https = true [[regions]] iad [deploy] strategy = "rolling" ```

fly deploy and fly regions add lhr nrt syd fra gru for 6 edges.

Step 4 — Browser PCM

Use a 24kHz AudioWorklet to capture PCM16 chunks every 20ms and forward as base64-wrapped input_audio_buffer.append events.

Step 5 — gpt-realtime-2 reasoning

The 2026 gpt-realtime-2 model handles complex multi-tool calls in one turn. Set session.instructions with up to 8K tokens of policy without hurting latency.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Step 6 — Anycast tip

Cloudflare's *.callsphere.aivoice-edge.fly.dev with proxied = false (DNS-only) so WebSocket round-trips bypass the proxy.

Pitfalls

  • Bun's WS spec drift: Some ws.send overloads differ from Node — test on Bun, not just locally.
  • Fly cold start: Set min_machines_running = 1 per region to keep voice cold-starts <50ms.
  • gpt-realtime-2 cost: Audio in/out roughly $0.07/$0.27 per minute (estimate; check current pricing).

How CallSphere does this in production

CallSphere's edge voice fleet handles 1.2M+ minutes/month across 6 verticals with 37 agents and 90+ tools. Healthcare (FastAPI), OneRoof (Next.js 16 + React 19), Salon (NestJS 10 + Prisma), Sales (Node.js 20 + React 18 + Vite). All voice flows route through a Bun + Hono relay. $149/$499/$1,499, 14-day trial, 22% affiliate.

FAQ

Why not Node? Bun's WebSocket implementation is ~2x faster on raw throughput.

Cloudflare Workers? Workers cap WS connections at 6 hours and have no persistent state — Fly + Bun is simpler.

TURN servers? WebSocket relays don't need them; only WebRTC direct does.

Cost? Fly: ~$5/region/month for 256MB shared CPU. OpenAI: ~$0.20-$0.30/min.

Sources

## How this plays out in production To make the framing in *Build a Bun + Hono + OpenAI Realtime Voice Agent on the Edge (2026)* operational, the trade-off you cannot defer is channel routing between voice and chat — a missed call should not die, it should warm up the SMS or web-chat lane within seconds. Treat this as a voice-first system from the first prompt: the agent's persona, its tool surface, and its escalation rules all flow from that single decision. Teams that ship fast tend to instrument the loop end-to-end before they tune any single component, because the bottleneck is rarely where intuition puts it. ## Voice agent architecture, end to end A production-grade voice stack at CallSphere stitches Twilio Programmable Voice (PSTN ingress, TwiML, bidirectional Media Streams) to a realtime reasoning layer — typically OpenAI Realtime or ElevenLabs Conversational AI — with sub-second response as a hard SLO. Anything north of one second of perceived silence and callers either repeat themselves or hang up; that single number drives the whole architecture. Server-side VAD with proper barge-in support is non-negotiable, otherwise the agent talks over the caller and the conversation collapses. Streaming TTS with phoneme-aligned interruption keeps the cadence natural even when the user changes their mind mid-sentence. Post-call, every transcript is run through a structured pipeline: sentiment, intent classification, lead score, escalation flag, and a normalized slot extraction (name, callback number, reason, urgency). For healthcare workloads, the BAA-covered storage path, audit logs, encryption-at-rest, and PHI-safe transcript redaction are wired in from day one, not bolted on at compliance review. The end state is a system where every call produces a row of structured data, not just a recording. ## FAQ **What changes when you move a voice agent the way *Build a Bun + Hono + OpenAI Realtime Voice Agent on the Edge (2026)* describes?** Treat the architecture in this post as a starting point and instrument it before you tune it. The metrics that matter most early on are end-to-end latency (target < 1s for voice, < 3s for chat), barge-in correctness, tool-call success rate, and post-conversation lead score distribution. Optimize whatever the data flags as the bottleneck, not whatever feels slowest in your head. **Where does this break down for voice agent deployments at scale?** The two failure modes that bite hardest are silent context loss across multi-turn handoffs and tool calls that succeed in dev but get rate-limited in production. Both are solvable with a proper agent backplane that pins state to a session ID, retries with backoff, and writes every tool invocation to an audit log you can replay. **How does the After-Hours Escalation product make sure no urgent call is dropped?** It runs 7 agents on a Primary → Secondary → 6-fallback ladder with a 120-second ACK timeout per leg. If the primary on-call does not acknowledge inside the window, the next contact is paged automatically — voice, SMS, and push — until somebody owns the incident. ## See it live Book a 30-minute working session at [calendly.com/sagar-callsphere/new-meeting](https://calendly.com/sagar-callsphere/new-meeting) and bring a real call flow — we will walk it through the live after-hours escalation product at [escalation.callsphere.tech](https://escalation.callsphere.tech) and show you exactly where the production wiring sits.
Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.