Voice AI Architecture: Understanding the STT-LLM-TTS Pipeline

The Three Stages of a Voice AI Agent

Every voice AI agent — whether it is a customer service bot, a voice assistant, or a conversational IVR — follows the same fundamental pipeline. Audio comes in from a microphone, gets converted to text, passes through a language model for reasoning, and the response gets converted back to speech. This is the STT-LLM-TTS pipeline, and understanding each stage is essential for building responsive voice agents.

The pipeline looks deceptively simple, but each stage introduces latency, and the cumulative delay determines whether your agent feels natural or robotic.

Stage 1: Speech-to-Text (STT)

The STT stage converts raw audio into text that the language model can process. Modern STT engines use transformer-based models trained on thousands of hours of multilingual speech data.

flowchart LR
    CALLER(["Caller"])
    subgraph TEL["Telephony"]
        SIP["Twilio SIP and PSTN"]
    end
    subgraph BRAIN["Business AI Agent"]
        STT["Streaming STT<br/>Deepgram or Whisper"]
        NLU{"Intent and<br/>Entity Extraction"}
        TOOLS["Tool Calls"]
        TTS["Streaming TTS<br/>ElevenLabs or Rime"]
    end
    subgraph DATA["Live Data Plane"]
        CRM[("CRM and Notes")]
        CAL[("Calendar and<br/>Schedule")]
        KB[("Knowledge Base<br/>and Policies")]
    end
    subgraph OUT["Outcomes"]
        O1(["Booking captured"])
        O2(["CRM record created"])
        O3(["Human handoff"])
    end
    CALLER --> SIP --> STT --> NLU
    NLU -->|Lookup| TOOLS
    TOOLS <--> CRM
    TOOLS <--> CAL
    TOOLS <--> KB
    NLU --> TTS --> SIP --> CALLER
    NLU -->|Resolved| O1
    NLU -->|Schedule| O2
    NLU -->|Escalate| O3
    style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
    style O1 fill:#059669,stroke:#047857,color:#fff
    style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
    style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937

import asyncio
from deepgram import DeepgramClient, LiveTranscriptionEvents, LiveOptions

class STTProcessor:
    def __init__(self, api_key: str):
        self.client = DeepgramClient(api_key)
        self.transcript_buffer = []

    async def start_streaming(self, on_transcript):
        connection = self.client.listen.asynclive.v("1")

        async def on_message(self, result, **kwargs):
            transcript = result.channel.alternatives[0].transcript
            if transcript:
                on_transcript(transcript, result.is_final)

        connection.on(LiveTranscriptionEvents.Transcript, on_message)

        options = LiveOptions(
            model="nova-2",
            language="en",
            encoding="linear16",
            sample_rate=16000,
            interim_results=True,   # Get partial results for faster feedback
            endpointing=300,        # Silence threshold in ms
            vad_events=True,        # Voice activity detection
        )

        await connection.start(options)
        return connection

Key STT considerations include model accuracy (measured by Word Error Rate), streaming versus batch mode, and endpointing — detecting when the user has finished speaking. Streaming STT returns interim results as the user speaks, which enables the pipeline to start LLM processing before the user finishes their sentence.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Stage 2: Language Model (LLM)

Once text is available, it is sent to a language model for reasoning. The LLM maintains conversation context, interprets intent, calls tools if needed, and generates a response.

import openai

class LLMProcessor:
    def __init__(self, model: str = "gpt-4o-mini"):
        self.client = openai.AsyncOpenAI()
        self.model = model
        self.messages = []

    async def process_streaming(self, user_text: str):
        self.messages.append({"role": "user", "content": user_text})

        stream = await self.client.chat.completions.create(
            model=self.model,
            messages=self.messages,
            stream=True,
            max_tokens=200,       # Keep responses concise for voice
            temperature=0.7,
        )

        full_response = []
        async for chunk in stream:
            delta = chunk.choices[0].delta.content
            if delta:
                full_response.append(delta)
                yield delta  # Stream tokens to TTS immediately

        self.messages.append({
            "role": "assistant",
            "content": "".join(full_response),
        })

For voice agents, the LLM should generate short, conversational responses. Long paragraphs that work in chat feel unnatural when spoken aloud. System prompts should instruct the model to keep answers under two or three sentences.

Stage 3: Text-to-Speech (TTS)

The final stage converts the LLM response into audio. Modern TTS engines produce remarkably natural speech with appropriate prosody, emotion, and pacing.

import httpx

class TTSProcessor:
    def __init__(self, api_key: str):
        self.api_key = api_key
        self.base_url = "https://api.elevenlabs.io/v1"

    async def synthesize_streaming(self, text_chunks):
        """Stream TTS as text tokens arrive from LLM."""
        buffer = ""
        async for chunk in text_chunks:
            buffer += chunk
            # Send to TTS at sentence boundaries for natural prosody
            if any(buffer.endswith(p) for p in [".", "!", "?", ","]):
                audio = await self._synthesize(buffer.strip())
                yield audio
                buffer = ""
        if buffer.strip():
            yield await self._synthesize(buffer.strip())

    async def _synthesize(self, text: str) -> bytes:
        async with httpx.AsyncClient() as client:
            response = await client.post(
                f"{self.base_url}/text-to-speech/voice_id/stream",
                headers={"xi-api-key": self.api_key},
                json={"text": text, "model_id": "eleven_turbo_v2_5"},
            )
            return response.content

Latency Budget Breakdown

A responsive voice agent needs end-to-end latency under 800ms. Here is a typical budget:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

STT endpointing: 200-400ms (silence detection after user stops)
STT final transcription: 100-300ms
LLM first token: 200-500ms
TTS first audio byte: 100-300ms
Network overhead: 50-100ms

The key optimization is streaming at every stage. Instead of waiting for each stage to complete, you stream partial results to the next stage. Interim STT results can warm up the LLM context. Streaming LLM tokens feed directly into streaming TTS. This overlapping approach can cut perceived latency by 40-60%.

Putting It All Together

class VoiceAgentPipeline:
    def __init__(self, stt, llm, tts):
        self.stt = stt
        self.llm = llm
        self.tts = tts

    async def handle_audio(self, audio_stream):
        # STT processes audio and emits transcripts
        transcript = await self.stt.transcribe(audio_stream)

        # LLM streams response tokens
        token_stream = self.llm.process_streaming(transcript)

        # TTS converts tokens to audio as they arrive
        async for audio_chunk in self.tts.synthesize_streaming(token_stream):
            yield audio_chunk  # Send to client immediately

FAQ

What is the biggest bottleneck in the voice AI pipeline?

The LLM stage typically contributes the most latency, especially the time to first token (TTFT). Using smaller models like GPT-4o-mini, or deploying local models with vLLM, can significantly reduce this bottleneck. Streaming the LLM output so TTS can start before the full response is generated is the single most impactful optimization.

Can I run the entire pipeline locally without cloud APIs?

Yes. You can use Whisper for STT, a local LLM via Ollama or vLLM, and Piper or Coqui TTS for speech synthesis. Local pipelines eliminate network latency entirely but require a GPU-equipped machine for acceptable performance. A machine with an NVIDIA RTX 4090 can run the full pipeline with sub-500ms latency.

How does the pipeline handle overlapping speech or interruptions?

This is called barge-in handling. The STT stage uses Voice Activity Detection (VAD) to detect when the user starts speaking during agent output. When barge-in is detected, the pipeline cancels the current TTS playback, processes the new user input, and generates a fresh response.

#VoiceAI #STT #TTS #LLMPipeline #SpeechRecognition #RealTimeAI #AgenticAI #LearnAI #AIEngineering

Voice AI Architecture: Understanding the STT-LLM-TTS Pipeline

The Three Stages of a Voice AI Agent

Stage 1: Speech-to-Text (STT)

Stage 2: Language Model (LLM)

Stage 3: Text-to-Speech (TTS)

Latency Budget Breakdown

Putting It All Together

FAQ

What is the biggest bottleneck in the voice AI pipeline?

Can I run the entire pipeline locally without cloud APIs?

How does the pipeline handle overlapping speech or interruptions?

Try CallSphere AI Voice Agents

Related Articles You May Like

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

Defense, ITAR & AI Voice Vendor Compliance in 2026

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

Call Sentiment Time-Series Dashboards for Voice AI in 2026