Build a Voice Agent with Cartesia Sonic-3 TTS (40ms First Audio, 2026)
Cartesia Sonic-3 returns first audio in ~40ms with controllable emotion and laughter tags. Wire it into a Pipecat agent — Python code, voice cloning, pitfalls.
TL;DR — Cartesia Sonic-3 is the fastest streaming TTS of 2026 — 40ms time-to-first-audio, fine-grained
<volume>/<speed>/<emotion>tags, AI-laughter, and a 30-second voice clone. Pair it with any voice agent and you'll cut p95 voice-to-voice 100-200ms.
What you'll build
A Pipecat voice agent that uses Sonic-3 streaming over WebSocket, applies inline emotion tags from the LLM, and clones a brand voice from a 30-second WAV — running on Daily WebRTC.
Architecture
flowchart LR
CL[Caller] --> RM[Daily room]
RM --> ST[Deepgram]
ST --> LL[GPT-4o + emotion markup]
LL --> CR[Cartesia Sonic-3 WS]
CR -- 40ms first audio --> RM --> CL
Step 1 — Install
```bash pip install "cartesia[websockets]" "pipecat-ai[daily,deepgram,openai,cartesia]" ```
Step 2 — Quick TTS test
```python from cartesia import Cartesia import os, sounddevice as sd, numpy as np
c = Cartesia(api_key=os.environ["CARTESIA_API_KEY"])
ws = c.tts.websocket()
out = ws.send(
model_id="sonic-3",
voice={"id": "79a125e8-cd45-4c13-8a67-188112f4dd22"},
transcript="
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 3 — Clone a brand voice
```python clip = open("brand_voice_30s.wav", "rb") voice = c.voices.clone( clip=clip, name="Sunrise Brand", language="en", mode="similarity", # 'similarity' for fidelity, 'stability' for novel sentences ) print(voice.id) # save this UUID ```
Step 4 — Wire into Pipecat
```python from pipecat.services.cartesia.tts import CartesiaTTSService
tts = CartesiaTTSService(
api_key=os.environ["CARTESIA_API_KEY"],
voice_id="
Step 5 — Inline emotion from the LLM
Add to the system prompt:
```
Wrap key phrases in emotion tags:
Sonic-3 parses the tags and modulates accordingly — no extra API call needed.
Step 6 — LiveKit plugin variant
```python from livekit.plugins import cartesia
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
session = AgentSession(
tts=cartesia.TTS(model="sonic-3", voice="
Step 7 — Latency budget
Realistic 2026 budget for end-to-end voice-to-voice:
- STT: 150ms (Deepgram Nova-3)
- LLM TTFB: 250ms (GPT-4o)
- TTS first audio: 40ms (Sonic-3)
- Network round trip: 80ms
- Total: ~520ms p50
Pitfalls
- Snapshot pinning:
sonic-3floats — pinsonic-3-2026-01-12for production reproducibility. - Emotion tag escaping: Don't let user transcripts inject unescaped
<emotion>tags — sanitize. - Voice clone licensing: You must have rights to the source clip; Cartesia ToS is strict here.
- PCM vs MP3: For voice agents always use
pcm_f32le— MP3 adds 50-150ms decode latency.
How CallSphere does this
CallSphere voices its 6 verticals with cloned brand voices on Sonic-3, feeding 37 agents · 90+ tools · 115+ DB tables. Voice-to-voice p95 is ~720ms across the fleet. $149/$499/$1,499 · 14-day trial · 22% affiliate.
FAQ
Pricing? ~$15/M characters — competitive with ElevenLabs Turbo, ~3x cheaper than Multilingual v2.
Multilingual? Yes, 15+ languages with native pronunciation; specify language: "es" etc.
SSML? Sonic-3 prefers Cartesia's tag syntax over SSML; both are supported.
Self-hosting? No — cloud-only API, but with regional endpoints in US/EU.
Sources
- Cartesia Docs - Sonic 3 - https://docs.cartesia.ai/build-with-cartesia/tts-models/latest
- Cartesia Sonic Page - https://cartesia.ai/sonic
- GetStream - Build a Voice AI App with Sonic 3 - https://getstream.io/blog/cartesia-sonic-3-tts/
- LiveKit Docs - Cartesia TTS - https://docs.livekit.io/agents/models/tts/cartesia/
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.