Skip to content
AI Voice Agents
AI Voice Agents10 min read0 views

Build a Voice Agent with Daily Bots: Hosted Pipecat Cloud (2026)

Daily Bots gives you Pipecat as a managed service — POST a config, get a WebRTC bot. Real curl + RTVI client code, model swaps, and prod pitfalls.

TL;DR — Daily Bots is the hosted version of Pipecat. You POST a JSON config to /start, get a Daily room URL back, and connect any browser/iOS/Android client with the RTVI SDK. No bot infrastructure to run yourself.

What you'll build

A 20-line Node script that spins up a Cartesia-voiced GPT-4o bot in a fresh Daily room, plus a 30-line React client that joins it with mic + speaker — all running through Daily's global SFU.

Architecture

flowchart LR
  CL[Browser RTVI client] -- WebRTC --> RM[Daily room]
  AP[Your /start endpoint] -- POST /bots/start --> DB[Daily Bots API]
  DB -- spawns --> BOT[Hosted Pipecat bot]
  BOT -- audio --> RM --> CL

Step 1 — Get keys

Sign up at dashboard.daily.co for a Daily Bots account (separate from the Daily video API). Add your OpenAI + Cartesia keys in the dashboard secrets vault — Daily Bots references them by name, not value.

Step 2 — Server: POST /start

```ts // app/api/start/route.ts export async function POST() { const r = await fetch("https://api.daily.co/v1/bots/start", { method: "POST", headers: { Authorization: Bearer ${process.env.DAILY_API_KEY}, "Content-Type": "application/json", }, body: JSON.stringify({ bot_profile: "voice_2024_10", max_duration: 600, services: { stt: "deepgram", llm: "openai", tts: "cartesia" }, config: [ { service: "vad", options: [{ name: "params", value: { stop_secs: 0.4 } }] }, { service: "tts", options: [{ name: "voice", value: "79a125e8-cd45-4c13-8a67-188112f4dd22" }] }, { service: "llm", options: [ { name: "model", value: "gpt-4o" }, { name: "initial_messages", value: [ { role: "system", content: "You are a friendly clinic concierge." }, ] }, { name: "run_on_config", value: true }, ]}, ], }), }); return Response.json(await r.json()); } ```

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Step 3 — Client: RTVI React

```tsx "use client"; import { RTVIClient } from "@pipecat-ai/client-js"; import { DailyTransport } from "@pipecat-ai/daily-transport";

export function VoiceBot() { async function connect() { const { room_url, token } = await fetch("/api/start", { method: "POST" }) .then((r) => r.json()); const client = new RTVIClient({ transport: new DailyTransport(), params: { baseUrl: room_url, token }, enableMic: true, enableCam: false, }); await client.connect(); } return ; } ```

Step 4 — Swap the LLM live

POST /bots/<id>/action with {"service":"llm","action":"set_model","arguments":[{"name":"model","value":"claude-3-5-sonnet"}]} and the bot hot-swaps providers mid-session.

Step 5 — Function calls

Add tools to the llm config. When the LLM emits a tool call, Daily Bots forwards it to your webhook URL and resumes once you POST the result back.

Step 6 — Inspect transcripts

Subscribe to the transcript RTVI message type on the client to render live captions, or pull the recording + transcript from the Daily REST API after the call.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Pitfalls

  • Bot profile pin: Always pin bot_profile: "voice_2024_10" (or newer dated tag) — "voice_latest" can change overnight and break configs.
  • run_on_config: true: Without it, the bot waits silently until the user speaks first — unfriendly for outbound calls.
  • Region selection: Pass { "geo": "us-east" } for SIP/PSTN bridging tasks — round-trip latency matters more than for browser-only bots.
  • Concurrency limits: Default is 5 concurrent bots — file a support ticket before scaling promos.

How CallSphere does this

CallSphere uses Daily Bots for spike-traffic webinar lines and demo lines while running Pipecat directly on k3s for steady traffic. 37 agents · 90+ tools · 115+ DB tables · 6 verticals · $149/$499/$1,499 · 14-day trial · 22% affiliate.

FAQ

Pricing? Per-bot-minute, billed against the model providers you choose plus a Daily margin — typically $0.05-0.20/min.

SIP/PSTN? Yes, via Daily's Pinless SIP — POST {"sip": {"display_name": "..."}} to bridge a phone number into the bot's room.

Recording? Set recording_settings: { type: "cloud" } — MP4 + transcript appear in your S3 in ~30s.

Open-source fallback? Run the same Pipecat config on your own infra — same bot code, just self-hosted.

Sources

## How this plays out in production Building on the discussion above in *Build a Voice Agent with Daily Bots: Hosted Pipecat Cloud (2026)*, the place this gets non-obvious in production is the latency budget — every leg of the audio loop (capture, ASR, reasoning, TTS, transport) eats into the <1s response window callers expect. Treat this as a voice-first system from the first prompt: the agent's persona, its tool surface, and its escalation rules all flow from that single decision. Teams that ship fast tend to instrument the loop end-to-end before they tune any single component, because the bottleneck is rarely where intuition puts it. ## Voice agent architecture, end to end A production-grade voice stack at CallSphere stitches Twilio Programmable Voice (PSTN ingress, TwiML, bidirectional Media Streams) to a realtime reasoning layer — typically OpenAI Realtime or ElevenLabs Conversational AI — with sub-second response as a hard SLO. Anything north of one second of perceived silence and callers either repeat themselves or hang up; that single number drives the whole architecture. Server-side VAD with proper barge-in support is non-negotiable, otherwise the agent talks over the caller and the conversation collapses. Streaming TTS with phoneme-aligned interruption keeps the cadence natural even when the user changes their mind mid-sentence. Post-call, every transcript is run through a structured pipeline: sentiment, intent classification, lead score, escalation flag, and a normalized slot extraction (name, callback number, reason, urgency). For healthcare workloads, the BAA-covered storage path, audit logs, encryption-at-rest, and PHI-safe transcript redaction are wired in from day one, not bolted on at compliance review. The end state is a system where every call produces a row of structured data, not just a recording. ## FAQ **What changes when you move a voice agent the way *Build a Voice Agent with Daily Bots: Hosted Pipecat Cloud (2026)* describes?** Treat the architecture in this post as a starting point and instrument it before you tune it. The metrics that matter most early on are end-to-end latency (target < 1s for voice, < 3s for chat), barge-in correctness, tool-call success rate, and post-conversation lead score distribution. Optimize whatever the data flags as the bottleneck, not whatever feels slowest in your head. **Where does this break down for voice agent deployments at scale?** The two failure modes that bite hardest are silent context loss across multi-turn handoffs and tool calls that succeed in dev but get rate-limited in production. Both are solvable with a proper agent backplane that pins state to a session ID, retries with backoff, and writes every tool invocation to an audit log you can replay. **How does the CallSphere healthcare voice agent handle a typical patient intake?** The healthcare stack runs 14 specialist tools against 20+ database tables, captures intent and slots in real time, and produces a post-call sentiment score, lead score, and escalation flag for every conversation — so the front desk inherits a triaged queue, not a stack of voicemails. ## See it live Book a 30-minute working session at [calendly.com/sagar-callsphere/new-meeting](https://calendly.com/sagar-callsphere/new-meeting) and bring a real call flow — we will walk it through the live healthcare voice agent at [healthcare.callsphere.tech](https://healthcare.callsphere.tech) and show you exactly where the production wiring sits.
Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like