Building Conversational AI with WebRTC and LLMs: Real-Time Voice Agents
A technical guide to building real-time voice AI agents using WebRTC for audio transport, speech-to-text, LLM reasoning, and text-to-speech in a low-latency pipeline.
Voice Is the Next Interface for AI Agents
Text-based AI interactions dominate today, but voice is the natural human communication medium. Building voice AI agents that feel conversational — with low latency, natural turn-taking, and contextual understanding — requires integrating multiple real-time systems: audio transport (WebRTC), speech recognition (STT), language model reasoning (LLM), and speech synthesis (TTS).
The technical challenge is latency. A human-to-human conversation has roughly 200-300ms of silence between turns. To feel natural, a voice AI agent must perceive speech, understand it, reason about a response, generate speech, and deliver audio within a similar window.
Architecture Overview
User's Browser
|
| WebRTC (audio stream)
|
Media Server (audio processing)
|
+-> VAD (Voice Activity Detection) -> STT (Speech-to-Text)
| |
| LLM Reasoning
| |
+<- Audio Stream <-- TTS (Text-to-Speech) <-+
WebRTC: The Audio Transport Layer
WebRTC provides peer-to-peer real-time communication with built-in handling for NAT traversal, codec negotiation, and network adaptation. For voice AI, it solves critical problems:
sequenceDiagram
autonumber
participant Caller as Caller
participant Agent as CallSphere Agent
participant API as CRM API
participant DB as CRM Database
participant Webhook as Webhook Listener
Caller->>Agent: Inbound call begins
Agent->>Agent: STT plus intent detection
Agent->>API: Lookup contact by phone
API->>DB: Read contact record
DB-->>API: Contact and history
API-->>Agent: Personalized context
Agent->>API: Create call activity
Agent->>API: Update deal stage
API->>Webhook: Outbound webhook fires
Webhook-->>Agent: Confirmed
Agent->>Caller: Spoken confirmation
- Low latency: Sub-100ms audio delivery over UDP with adaptive bitrate
- Echo cancellation: Built-in AEC prevents the agent from hearing its own voice through the user's speakers
- Noise suppression: Reduces background noise before audio reaches the STT model
- Browser support: No plugins required; works in all modern browsers
Server-Side WebRTC with Mediasoup or LiveKit
For production deployments, a media server sits between the user and the AI pipeline:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
// LiveKit server-side participant (simplified)
import { RoomServiceClient, Room } from 'livekit-server-sdk';
const roomService = new RoomServiceClient(LIVEKIT_URL, API_KEY, API_SECRET);
// Create a room for the voice session
await roomService.createRoom({ name: 'voice-session-123' });
// AI agent joins as a participant
const agentToken = generateToken({ identity: 'ai-agent', roomName: 'voice-session-123' });
const room = await Room.connect(LIVEKIT_URL, agentToken);
// Receive audio from user
room.on('trackSubscribed', async (track) => {
const audioStream = track.getMediaStream();
await processAudioStream(audioStream);
});
Voice Activity Detection (VAD)
VAD determines when the user starts and stops speaking. This is critical for turn-taking:
- Silero VAD: Open-source model with high accuracy and low latency (< 10ms). The most popular choice for voice agent pipelines.
- WebRTC's built-in VAD: Lower accuracy but zero additional compute cost.
Handling Interruptions
Natural conversation includes interruptions. When the user starts speaking while the agent is talking:
- Detect user speech onset via VAD
- Immediately stop TTS playback
- Discard any un-played generated audio
- Process the user's new utterance
- Generate a fresh response that acknowledges the interruption if appropriate
Speech-to-Text Pipeline
Streaming STT
For low latency, STT must process audio incrementally rather than waiting for the complete utterance:
- Deepgram: Streaming API with 200-300ms latency, strong accuracy, and speaker diarization
- OpenAI Whisper (self-hosted): whisper.cpp or faster-whisper for on-premise deployments
- AssemblyAI: Real-time transcription with under 300ms latency
Optimizing STT Latency
- Stream audio in small chunks (20-100ms frames) rather than waiting for silence
- Use endpointing models that detect end-of-utterance faster than fixed silence timeouts
- Pre-warm STT connections to eliminate cold-start latency on the first utterance
LLM Reasoning Layer
The LLM processes the transcribed text and generates a response. For voice, two optimizations are critical:
Streaming Token Generation
Start TTS on the first generated tokens without waiting for the complete response. This "time to first byte" optimization can reduce perceived latency by 1-3 seconds:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
async def stream_llm_to_tts(transcript: str):
buffer = ""
async for chunk in llm.stream(messages=[{"role": "user", "content": transcript}]):
buffer += chunk.text
# Send to TTS at sentence boundaries for natural speech
if buffer.endswith((".", "!", "?", ":")):
audio = await tts.synthesize(buffer)
await send_audio_to_user(audio)
buffer = ""
Voice-Optimized Prompting
LLM responses for voice agents should be:
- Concise: 1-3 sentences per turn, not paragraphs
- Conversational: Use contractions, simple vocabulary, and natural phrasing
- Action-oriented: Confirm actions clearly ("I've updated your appointment to Thursday at 3 PM")
- Turn-taking aware: End with a question or clear stopping point
Text-to-Speech
Low-Latency TTS Options
| Provider | Latency | Quality | Streaming |
|---|---|---|---|
| ElevenLabs | 200-400ms | Very high | Yes |
| OpenAI TTS | 300-500ms | High | Yes |
| Cartesia | 100-200ms | High | Yes |
| XTTS v2 (open source) | 300-600ms | Good | Yes |
Voice Cloning and Consistency
Production voice agents need consistent voice characteristics across sessions. Most TTS providers support voice cloning from a short audio sample (10-30 seconds), allowing organizations to create branded agent voices.
End-to-End Latency Budget
For a natural-feeling conversation, the total pipeline latency should be under 1 second:
| Component | Target Latency |
|---|---|
| WebRTC transport | 50-100ms |
| VAD + endpointing | 200-300ms |
| STT transcription | 200-300ms |
| LLM time-to-first-token | 200-400ms |
| TTS time-to-first-audio | 150-300ms |
| Total | 800-1400ms |
Achieving the lower end of this range requires careful optimization at every stage, geographic co-location of services, and streaming throughout the pipeline rather than sequential processing.
Production Considerations
- Fallback handling: When any pipeline component fails, the agent should gracefully communicate the issue rather than going silent
- Session persistence: Maintain conversation state across WebRTC reconnections (mobile users switching between WiFi and cellular)
- Recording and transcription: Log complete conversations for quality review, with appropriate privacy disclosures
- Scalability: WebRTC media servers need horizontal scaling for concurrent sessions, typically 50-200 sessions per server
Sources: LiveKit Documentation | Deepgram Streaming API | Silero VAD
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.