Skip to content
Learn Agentic AI
Learn Agentic AI10 min read15 views

Introduction to Voice AI Agents with OpenAI: Architecture and Concepts

Understand the core architecture of voice AI agents — STT to Agent to TTS pipelines, the VoicePipeline SDK approach vs the Realtime API WebRTC approach, and when to use each for production voice applications.

Why Voice Agents Are the Next Frontier

Text-based AI agents have proven their value across customer service, sales, and internal operations. But humans communicate most naturally through speech. Voice AI agents bridge this gap by letting users speak naturally to an intelligent system that listens, reasons, and responds aloud — all in real time.

Building a voice agent is fundamentally different from building a chatbot. You need to handle continuous audio streams, manage turn-taking between human and machine, convert speech to text and text back to speech, and do all of this with latency low enough that the conversation feels natural. A delay of more than 500 milliseconds starts to feel awkward. Beyond one second, users lose confidence in the system.

This post covers the architecture of voice AI agents using OpenAI's tooling, the two primary approaches available today, and the decision framework for choosing between them.

The Core Pipeline: STT, Agent, TTS

Every voice agent follows the same fundamental pipeline, regardless of implementation:

flowchart LR
    CALLER(["Caller"])
    subgraph TEL["Telephony"]
        SIP["Twilio SIP and PSTN"]
    end
    subgraph BRAIN["Business AI Agent"]
        STT["Streaming STT<br/>Deepgram or Whisper"]
        NLU{"Intent and<br/>Entity Extraction"}
        TOOLS["Tool Calls"]
        TTS["Streaming TTS<br/>ElevenLabs or Rime"]
    end
    subgraph DATA["Live Data Plane"]
        CRM[("CRM and Notes")]
        CAL[("Calendar and<br/>Schedule")]
        KB[("Knowledge Base<br/>and Policies")]
    end
    subgraph OUT["Outcomes"]
        O1(["Booking captured"])
        O2(["CRM record created"])
        O3(["Human handoff"])
    end
    CALLER --> SIP --> STT --> NLU
    NLU -->|Lookup| TOOLS
    TOOLS <--> CRM
    TOOLS <--> CAL
    TOOLS <--> KB
    NLU --> TTS --> SIP --> CALLER
    NLU -->|Resolved| O1
    NLU -->|Schedule| O2
    NLU -->|Escalate| O3
    style CALLER fill:#f1f5f9,stroke:#64748b,color:#0f172a
    style NLU fill:#4f46e5,stroke:#4338ca,color:#fff
    style O1 fill:#059669,stroke:#047857,color:#fff
    style O2 fill:#0ea5e9,stroke:#0369a1,color:#fff
    style O3 fill:#f59e0b,stroke:#d97706,color:#1f2937

Speech-to-Text (STT) converts the user's spoken audio into text. The system captures raw audio from a microphone or audio stream, processes it through a speech recognition model, and produces a transcript the agent can understand.

Agent Processing takes the transcribed text, runs it through the LLM-powered agent (complete with tools, guardrails, and conversation history), and produces a text response. This is the same agent logic you would use in a text-based system.

Text-to-Speech (TTS) converts the agent's text response back into audio that gets played to the user. Modern TTS models produce natural-sounding speech with appropriate pacing and intonation.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
User speaks
    |
    v
[Microphone / Audio Stream]
    |
    v
[STT Model] --- raw audio --> transcript text
    |
    v
[Agent + Tools + Guardrails] --- text in --> text out
    |
    v
[TTS Model] --- text --> audio bytes
    |
    v
[Speaker / Audio Output]

This pipeline is conceptually simple but the engineering challenge lies in making each transition fast enough for real-time conversation. Every millisecond matters. The STT model needs to process audio chunks as they arrive, the agent needs to start generating before the full response is ready, and the TTS model needs to stream audio back while still generating.

Approach 1: VoicePipeline (SDK-Based)

The OpenAI Agents SDK includes a VoicePipeline class that orchestrates the STT-Agent-TTS pipeline on your server. Audio comes in, gets transcribed locally, passes through your agent, and the response gets synthesized back to audio — all managed by your application code.

from agents.voice import VoicePipeline, SingleAgentVoiceWorkflow
from agents import Agent

agent = Agent(
    name="VoiceAssistant",
    instructions="You are a helpful voice assistant. Keep responses concise.",
)

workflow = SingleAgentVoiceWorkflow(agent)
pipeline = VoicePipeline(workflow=workflow)

How it works under the hood:

  1. Your application captures audio (from a microphone, WebSocket, phone line, etc.)
  2. The pipeline sends audio to the STT model (default: OpenAI Whisper) and receives text
  3. The text feeds into your agent as a normal user message
  4. The agent processes the message, calls tools if needed, and returns text
  5. The pipeline sends the response text to the TTS model and receives audio chunks
  6. Your application plays the audio chunks to the user

Key characteristics:

  • Runs on your server — you control the infrastructure
  • Uses the standard Agents SDK agent model with full tool and handoff support
  • Each step is sequential: STT finishes, then the agent runs, then TTS generates
  • Latency is the sum of STT + agent processing + TTS generation
  • Works with any audio transport (WebSocket, SIP, HTTP upload)

Approach 2: Realtime API (WebRTC / WebSocket)

The OpenAI Realtime API takes a fundamentally different approach. Instead of breaking voice into three sequential steps, it uses a single multimodal model that processes audio directly and outputs audio directly — speech-to-speech with no intermediate text step.

User speaks
    |
    v
[Browser / Client]
    |
    v  (WebRTC audio track or WebSocket)
    |
    v
[OpenAI Realtime API — single model]
    |  processes audio natively
    |  generates audio natively
    v
[Browser / Client]
    |
    v
User hears response

With WebRTC, you establish a peer connection between the browser and OpenAI's servers. Audio flows over low-latency UDP, and the model responds in real time with generated speech. There is no separate STT or TTS step — the model handles the entire interaction natively.

Key characteristics:

  • Audio processing happens on OpenAI's infrastructure
  • Speech-to-speech: no intermediate text conversion (lower latency)
  • WebRTC provides the lowest possible latency (sub-300ms round trips)
  • Built-in voice activity detection (VAD) handles turn-taking
  • Supports function calling directly from audio input
  • Requires client-side JavaScript for WebRTC connections

When to Use Each Approach

The two approaches serve different use cases. Here is a decision framework:

Choose VoicePipeline when:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

  • You need full control over each pipeline stage (custom STT models, custom TTS voices)
  • Your agent logic is complex with many tools, handoffs, and guardrails
  • You are integrating with telephony systems (SIP, Twilio, etc.)
  • You need to process audio server-side (recording, compliance, logging)
  • Latency requirements are moderate (1-3 seconds is acceptable)
  • You want to reuse existing text-based agents with minimal changes

Choose Realtime API when:

  • Ultra-low latency is critical (sub-500ms)
  • You are building browser-based voice experiences
  • The interaction is conversational and benefits from natural turn-taking
  • You want built-in VAD and interruption handling
  • Your agent logic is relatively straightforward
  • You want to minimize server-side infrastructure

Hybrid approaches are also viable. Some production systems use the Realtime API for the conversational interface but hand off to a VoicePipeline-based agent for complex tool execution. The Realtime API handles the fast back-and-forth, while the pipeline handles the heavy processing.

Latency Breakdown

Understanding where latency accumulates helps you make informed architecture decisions:

Component VoicePipeline Realtime API
Audio capture 50-100ms 50-100ms
STT processing 200-500ms N/A (native)
Network to API 50-150ms 10-50ms (WebRTC)
Agent/LLM processing 300-1500ms 200-800ms
TTS generation 200-500ms N/A (native)
Audio playback start 50-100ms 50-100ms
Total round trip 850-2850ms 310-1050ms

The Realtime API wins on raw latency because it eliminates two network round trips (STT and TTS are built into the model) and uses UDP-based WebRTC instead of TCP-based HTTP.

Audio Format Fundamentals

Both approaches work with PCM audio at specific sample rates. Understanding these formats is essential for proper integration:

import numpy as np

# Standard format for OpenAI voice APIs
SAMPLE_RATE = 24000    # 24 kHz
CHANNELS = 1           # Mono
DTYPE = np.int16       # 16-bit signed integers
CHUNK_DURATION = 0.1   # 100ms chunks

# Calculate buffer size
samples_per_chunk = int(SAMPLE_RATE * CHUNK_DURATION)  # 2400 samples
bytes_per_chunk = samples_per_chunk * 2  # 4800 bytes (16-bit = 2 bytes)

The 24kHz sample rate at 16-bit mono is the standard across both VoicePipeline and the Realtime API. If your audio source uses a different format (48kHz phone audio, 16kHz Whisper-native, etc.), you need to resample before feeding it into the pipeline.

What Comes Next

This post established the foundational concepts. The following posts in this series dive deep into each approach:

  • Building your first voice agent with VoicePipeline — capturing audio, running the pipeline, and playing responses
  • Configuring STT and TTS models — Whisper settings, voice selection, and streaming synthesis
  • Handling real-time audio streams with StreamedAudioInput and voice activity detection
  • Building browser-based voice agents with WebRTC and the Realtime API

Each post includes complete, runnable code that you can adapt for your own voice agent applications.

Sources:

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

Agentic AI

Building OpenAI Realtime Voice Agents with an Eval Pipeline (2026)

Build a working voice agent with the OpenAI Realtime API + Agents SDK, then bolt on an eval pipeline that catches barge-in failures, hallucinated grounding, and latency regressions.

Agentic AI

Online vs Offline Agent Evaluation: The Pre-Deploy / Post-Deploy Split

Offline evals catch regressions before deploy on a fixed dataset. Online evals catch real-world drift on live traffic. You need both — here is how we run them.

Agentic AI

Voice Agent Quality Metrics in 2026: WER, Latency, Grounding, and the Ones Most Teams Miss

The full metric set for evaluating production voice agents — STT word error rate, end-to-end latency budgets, RAG grounding, prosody, and the metrics that actually correlate with retention.

Agentic AI

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

Build a working computer-use agent with the OpenAI Computer Use tool — clicks, types, scrolls a real browser — then evaluate task success on a benchmark suite.

Agentic AI

Browser Agents with LangGraph + Playwright: Visual Evaluation Pipelines That Don't Lie

Build a browser agent with LangGraph and Playwright that does multi-step web tasks, then ground-truth its work with visual diffs and DOM-based evaluators.