Skip to content
Learn Agentic AI
Learn Agentic AI11 min read14 views

Voice AI Architecture: Understanding the STT-LLM-TTS Pipeline

Learn the three-stage pipeline that powers every voice AI agent — speech-to-text, language model reasoning, and text-to-speech — including latency budgets, streaming strategies, and practical implementation patterns.

The Three Stages of a Voice AI Agent

Every voice AI agent — whether it is a customer service bot, a voice assistant, or a conversational IVR — follows the same fundamental pipeline. Audio comes in from a microphone, gets converted to text, passes through a language model for reasoning, and the response gets converted back to speech. This is the STT-LLM-TTS pipeline, and understanding each stage is essential for building responsive voice agents.

The pipeline looks deceptively simple, but each stage introduces latency, and the cumulative delay determines whether your agent feels natural or robotic.

Stage 1: Speech-to-Text (STT)

The STT stage converts raw audio into text that the language model can process. Modern STT engines use transformer-based models trained on thousands of hours of multilingual speech data.

flowchart LR
    CALLER(["Caller"])
    subgraph TEL["Telephony"]
        SIP["Twilio SIP and PSTN"]
    end
    subgraph BRAIN["Business AI Agent"]
        STT["Streaming STT<br/>Deepgram or Whisper"]
        NLU{"Intent and<br/>Entity Extraction"}
        TOOLS["Tool Calls"]
        TTS["Streaming TTS<br/>ElevenLabs or Rime"]
    end
    subgraph DATA["Live Data Plane"]
        CRM[("CRM and Notes")]
        CAL[("Calendar and<br/>Schedule")]
        KB[("Knowledge Base<br/>and Policies")]
    end
    subgraph OUT["Outcomes"]
        O1(["Booking captured"])
        O2(["CRM record created"])
        O3(["Human handoff"])
    end
    CALLER --> SIP --> STT --> NLU
    NLU -->|Lookup| TOOLS
    TOOLS <--> CRM
    TOOLS <--> CAL
    TOOLS <--> KB
    NLU --> TTS --> SIP --> CALLER
    NLU -->|Resolved| O1
    NLU -->|Schedule| O2
    NLU -->|Escalate| O3
    style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
    style O1 fill:#059669,stroke:#047857,color:#fff
    style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
    style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937
import asyncio
from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions

class STTProcessor:
    def __init__(self, api_key: str):
        self.client = DeepgramClient(api_key)
        self.transcript_buffer = []

    async def start_streaming(self, on_transcript):
        connection = self.client.listen.asynclive.v("1")

        async def on_message(self, result, **kwargs):
            transcript = result.channel.alternatives[0].transcript
            if transcript:
                on_transcript(transcript, result.is_final)

        connection.on(LiveTranscriptionEvents.Transcript, on_message)

        options = LiveOptions(
            model="nova-2",
            language="en",
            encoding="linear16",
            sample_rate=16000,
            interim_results=True,   # Get partial results for faster feedback
            endpointing=300,        # Silence threshold in ms
            vad_events=True,        # Voice activity detection
        )

        await connection.start(options)
        return connection

Key STT considerations include model accuracy (measured by Word Error Rate), streaming versus batch mode, and endpointing — detecting when the user has finished speaking. Streaming STT returns interim results as the user speaks, which enables the pipeline to start LLM processing before the user finishes their sentence.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Stage 2: Language Model (LLM)

Once text is available, it is sent to a language model for reasoning. The LLM maintains conversation context, interprets intent, calls tools if needed, and generates a response.

import openai

class LLMProcessor:
    def __init__(self, model: str = "gpt-4o-mini"):
        self.client = openai.AsyncOpenAI()
        self.model = model
        self.messages = []

    async def process_streaming(self, user_text: str):
        self.messages.append({"role": "user", "content": user_text})

        stream = await self.client.chat.completions.create(
            model=self.model,
            messages=self.messages,
            stream=True,
            max_tokens=200,       # Keep responses concise for voice
            temperature=0.7,
        )

        full_response = []
        async for chunk in stream:
            delta = chunk.choices[0].delta.content
            if delta:
                full_response.append(delta)
                yield delta  # Stream tokens to TTS immediately

        self.messages.append({
            "role": "assistant",
            "content": "".join(full_response),
        })

For voice agents, the LLM should generate short, conversational responses. Long paragraphs that work in chat feel unnatural when spoken aloud. System prompts should instruct the model to keep answers under two or three sentences.

Stage 3: Text-to-Speech (TTS)

The final stage converts the LLM response into audio. Modern TTS engines produce remarkably natural speech with appropriate prosody, emotion, and pacing.

import httpx

class TTSProcessor:
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.base_url = "https://api.elevenlabs.io/v1"

    async def synthesize_streaming(self, text_chunks):
        """Stream TTS as text tokens arrive from LLM."""
        buffer = ""
        async for chunk in text_chunks:
            buffer += chunk
            # Send to TTS at sentence boundaries for natural prosody
            if any(buffer.endswith(p) for p in [".", "!", "?", ","]):
                audio = await self._synthesize(buffer.strip())
                yield audio
                buffer = ""
        if buffer.strip():
            yield await self._synthesize(buffer.strip())

    async def _synthesize(self, text: str) -> bytes:
        async with httpx.AsyncClient() as client:
            response = await client.post(
                f"{self.base_url}/text-to-speech/voice_id/stream",
                headers={"xi-api-key": self.api_key},
                json={"text": text, "model_id": "eleven_turbo_v2_5"},
            )
            return response.content

Latency Budget Breakdown

A responsive voice agent needs end-to-end latency under 800ms. Here is a typical budget:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

  • STT endpointing: 200-400ms (silence detection after user stops)
  • STT final transcription: 100-300ms
  • LLM first token: 200-500ms
  • TTS first audio byte: 100-300ms
  • Network overhead: 50-100ms

The key optimization is streaming at every stage. Instead of waiting for each stage to complete, you stream partial results to the next stage. Interim STT results can warm up the LLM context. Streaming LLM tokens feed directly into streaming TTS. This overlapping approach can cut perceived latency by 40-60%.

Putting It All Together

class VoiceAgentPipeline:
    def __init__(self, stt, llm, tts):
        self.stt = stt
        self.llm = llm
        self.tts = tts

    async def handle_audio(self, audio_stream):
        # STT processes audio and emits transcripts
        transcript = await self.stt.transcribe(audio_stream)

        # LLM streams response tokens
        token_stream = self.llm.process_streaming(transcript)

        # TTS converts tokens to audio as they arrive
        async for audio_chunk in self.tts.synthesize_streaming(token_stream):
            yield audio_chunk  # Send to client immediately

FAQ

What is the biggest bottleneck in the voice AI pipeline?

The LLM stage typically contributes the most latency, especially the time to first token (TTFT). Using smaller models like GPT-4o-mini, or deploying local models with vLLM, can significantly reduce this bottleneck. Streaming the LLM output so TTS can start before the full response is generated is the single most impactful optimization.

Can I run the entire pipeline locally without cloud APIs?

Yes. You can use Whisper for STT, a local LLM via Ollama or vLLM, and Piper or Coqui TTS for speech synthesis. Local pipelines eliminate network latency entirely but require a GPU-equipped machine for acceptable performance. A machine with an NVIDIA RTX 4090 can run the full pipeline with sub-500ms latency.

How does the pipeline handle overlapping speech or interruptions?

This is called barge-in handling. The STT stage uses Voice Activity Detection (VAD) to detect when the user starts speaking during agent output. When barge-in is detected, the pipeline cancels the current TTS playback, processes the new user input, and generates a fresh response.


#VoiceAI #STT #TTS #LLMPipeline #SpeechRecognition #RealTimeAI #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

AI Infrastructure

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

WebTransport is Baseline as of March 2026. Media Over QUIC ships in production within the year. Here is what changes for AI voice agents — and what stays the same.

AI Infrastructure

Defense, ITAR & AI Voice Vendor Compliance in 2026

ITAR technical-data definitions don't care if a human or an LLM produced the output. CMMC Level 2 has been mandatory since November 2025. Here is what an AI voice vendor needs to ship to defense in 2026.

AI Strategy

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Q1 2026 saw a record acquisition wave: Aircall bought Vogent (May), Meta acquired Manus and PlayAI, OpenAI closed six deals. The voice AI consolidation phase has begun.

AI Infrastructure

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.

AI Voice Agents

Call Sentiment Time-Series Dashboards for Voice AI in 2026

Sentiment is not a single number per call - it is a curve. The shape (started positive, dropped at minute 4, recovered) tells you what your AI did wrong. Here is the per-utterance sentiment pipeline and the dashboards we ship by vertical.