Skip to content
Technology
Technology6 min read27 views

Building Conversational AI with WebRTC and LLMs: Real-Time Voice Agents

A technical guide to building real-time voice AI agents using WebRTC for audio transport, speech-to-text, LLM reasoning, and text-to-speech in a low-latency pipeline.

Voice Is the Next Interface for AI Agents

Text-based AI interactions dominate today, but voice is the natural human communication medium. Building voice AI agents that feel conversational — with low latency, natural turn-taking, and contextual understanding — requires integrating multiple real-time systems: audio transport (WebRTC), speech recognition (STT), language model reasoning (LLM), and speech synthesis (TTS).

The technical challenge is latency. A human-to-human conversation has roughly 200-300ms of silence between turns. To feel natural, a voice AI agent must perceive speech, understand it, reason about a response, generate speech, and deliver audio within a similar window.

Architecture Overview

User's Browser
    |
    | WebRTC (audio stream)
    |
Media Server (audio processing)
    |
    +-> VAD (Voice Activity Detection) -> STT (Speech-to-Text)
    |                                         |
    |                                    LLM Reasoning
    |                                         |
    +<- Audio Stream <-- TTS (Text-to-Speech) <-+

WebRTC: The Audio Transport Layer

WebRTC provides peer-to-peer real-time communication with built-in handling for NAT traversal, codec negotiation, and network adaptation. For voice AI, it solves critical problems:

sequenceDiagram
    autonumber
    participant Caller as Caller
    participant Agent as CallSphere Agent
    participant API as CRM API
    participant DB as CRM Database
    participant Webhook as Webhook Listener
    Caller->>Agent: Inbound call begins
    Agent->>Agent: STT plus intent detection
    Agent->>API: Lookup contact by phone
    API->>DB: Read contact record
    DB-->>API: Contact and history
    API-->>Agent: Personalized context
    Agent->>API: Create call activity
    Agent->>API: Update deal stage
    API->>Webhook: Outbound webhook fires
    Webhook-->>Agent: Confirmed
    Agent->>Caller: Spoken confirmation
  • Low latency: Sub-100ms audio delivery over UDP with adaptive bitrate
  • Echo cancellation: Built-in AEC prevents the agent from hearing its own voice through the user's speakers
  • Noise suppression: Reduces background noise before audio reaches the STT model
  • Browser support: No plugins required; works in all modern browsers

Server-Side WebRTC with Mediasoup or LiveKit

For production deployments, a media server sits between the user and the AI pipeline:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
// LiveKit server-side participant (simplified)
import { RoomServiceClient, Room } from 'livekit-server-sdk';

const roomService = new RoomServiceClient(LIVEKIT_URL, API_KEY, API_SECRET);

// Create a room for the voice session
await roomService.createRoom({ name: 'voice-session-123' });

// AI agent joins as a participant
const agentToken = generateToken({ identity: 'ai-agent', roomName: 'voice-session-123' });
const room = await Room.connect(LIVEKIT_URL, agentToken);

// Receive audio from user
room.on('trackSubscribed', async (track) => {
    const audioStream = track.getMediaStream();
    await processAudioStream(audioStream);
});

Voice Activity Detection (VAD)

VAD determines when the user starts and stops speaking. This is critical for turn-taking:

  • Silero VAD: Open-source model with high accuracy and low latency (< 10ms). The most popular choice for voice agent pipelines.
  • WebRTC's built-in VAD: Lower accuracy but zero additional compute cost.

Handling Interruptions

Natural conversation includes interruptions. When the user starts speaking while the agent is talking:

  1. Detect user speech onset via VAD
  2. Immediately stop TTS playback
  3. Discard any un-played generated audio
  4. Process the user's new utterance
  5. Generate a fresh response that acknowledges the interruption if appropriate

Speech-to-Text Pipeline

Streaming STT

For low latency, STT must process audio incrementally rather than waiting for the complete utterance:

  • Deepgram: Streaming API with 200-300ms latency, strong accuracy, and speaker diarization
  • OpenAI Whisper (self-hosted): whisper.cpp or faster-whisper for on-premise deployments
  • AssemblyAI: Real-time transcription with under 300ms latency

Optimizing STT Latency

  • Stream audio in small chunks (20-100ms frames) rather than waiting for silence
  • Use endpointing models that detect end-of-utterance faster than fixed silence timeouts
  • Pre-warm STT connections to eliminate cold-start latency on the first utterance

LLM Reasoning Layer

The LLM processes the transcribed text and generates a response. For voice, two optimizations are critical:

Streaming Token Generation

Start TTS on the first generated tokens without waiting for the complete response. This "time to first byte" optimization can reduce perceived latency by 1-3 seconds:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

async def stream_llm_to_tts(transcript: str):
    buffer = ""
    async for chunk in llm.stream(messages=[{"role": "user", "content": transcript}]):
        buffer += chunk.text
        # Send to TTS at sentence boundaries for natural speech
        if buffer.endswith((".", "!", "?", ":")):
            audio = await tts.synthesize(buffer)
            await send_audio_to_user(audio)
            buffer = ""

Voice-Optimized Prompting

LLM responses for voice agents should be:

  • Concise: 1-3 sentences per turn, not paragraphs
  • Conversational: Use contractions, simple vocabulary, and natural phrasing
  • Action-oriented: Confirm actions clearly ("I've updated your appointment to Thursday at 3 PM")
  • Turn-taking aware: End with a question or clear stopping point

Text-to-Speech

Low-Latency TTS Options

Provider Latency Quality Streaming
ElevenLabs 200-400ms Very high Yes
OpenAI TTS 300-500ms High Yes
Cartesia 100-200ms High Yes
XTTS v2 (open source) 300-600ms Good Yes

Voice Cloning and Consistency

Production voice agents need consistent voice characteristics across sessions. Most TTS providers support voice cloning from a short audio sample (10-30 seconds), allowing organizations to create branded agent voices.

End-to-End Latency Budget

For a natural-feeling conversation, the total pipeline latency should be under 1 second:

Component Target Latency
WebRTC transport 50-100ms
VAD + endpointing 200-300ms
STT transcription 200-300ms
LLM time-to-first-token 200-400ms
TTS time-to-first-audio 150-300ms
Total 800-1400ms

Achieving the lower end of this range requires careful optimization at every stage, geographic co-location of services, and streaming throughout the pipeline rather than sequential processing.

Production Considerations

  • Fallback handling: When any pipeline component fails, the agent should gracefully communicate the issue rather than going silent
  • Session persistence: Maintain conversation state across WebRTC reconnections (mobile users switching between WiFi and cellular)
  • Recording and transcription: Log complete conversations for quality review, with appropriate privacy disclosures
  • Scalability: WebRTC media servers need horizontal scaling for concurrent sessions, typically 50-200 sessions per server

Sources: LiveKit Documentation | Deepgram Streaming API | Silero VAD

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Voice Agents

WebRTC Mobile Testing with BrowserStack + Sauce Labs (2026)

BrowserStack offers 30,000+ real devices; Sauce Labs ships deep Appium automation. Here is how AI voice agent teams use both for WebRTC mobile QA in 2026.

AI Infrastructure

Defense, ITAR & AI Voice Vendor Compliance in 2026

ITAR technical-data definitions don't care if a human or an LLM produced the output. CMMC Level 2 has been mandatory since November 2025. Here is what an AI voice vendor needs to ship to defense in 2026.

AI Infrastructure

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

WebTransport is Baseline as of March 2026. Media Over QUIC ships in production within the year. Here is what changes for AI voice agents — and what stays the same.

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

Agentic AI

Building OpenAI Realtime Voice Agents with an Eval Pipeline (2026)

Build a working voice agent with the OpenAI Realtime API + Agents SDK, then bolt on an eval pipeline that catches barge-in failures, hallucinated grounding, and latency regressions.

Agentic AI

Voice Agent Quality Metrics in 2026: WER, Latency, Grounding, and the Ones Most Teams Miss

The full metric set for evaluating production voice agents — STT word error rate, end-to-end latency budgets, RAG grounding, prosody, and the metrics that actually correlate with retention.