Building Multi-Agent Voice Systems with the OpenAI Agents SDK

Why one agent is not enough

A single agent with fifty tools and a thousand-line system prompt will work — badly. It will hallucinate tool names, forget constraints, and generally underperform a smaller agent focused on one job. Multi-agent systems split the problem: a triage agent that identifies intent, specialist agents that handle each intent deeply, and handoffs that move the conversation between them without losing context.

This post walks through building a multi-agent voice system with the OpenAI Agents SDK, the same pattern CallSphere uses across its real estate, healthcare, and sales verticals.

caller → triage_agent
              │
              ├── buyer_intent ───► buyer_specialist
              ├── seller_intent ──► seller_specialist
              ├── rental_intent ──► rental_specialist
              └── tour_intent ────► tour_coordinator

Architecture overview

┌───────────────────────────────────────┐
│          Session state (shared)      │
│  • caller info                        │
│  • conversation history               │
│  • collected fields                   │
└──────────────┬────────────────────────┘
               │
               ▼
┌───────────────────────────────────────┐
│ Triage agent (thin, routing only)     │
└──────────────┬────────────────────────┘
               │ handoff
    ┌──────────┼──────────┐
    ▼          ▼          ▼
┌───────┐  ┌───────┐  ┌───────┐
│buyer  │  │seller │  │rental │
│agent  │  │agent  │  │agent  │
└───┬───┘  └───┬───┘  └───┬───┘
    │          │          │
    ▼          ▼          ▼
   tools      tools      tools

Prerequisites

Python 3.11+ and the openai-agents package.
An OpenAI key with Realtime + Agents SDK access.
Per-agent tool definitions.

Step-by-step walkthrough

1. Define the triage agent

from agents import Agent, Runner, handoff

buyer_agent = Agent(
    name="Buyer Specialist",
    instructions="You help home buyers. Ask qualifying questions, check availability, and book tours.",
    tools=[search_listings, book_tour],
)

seller_agent = Agent(
    name="Seller Specialist",
    instructions="You help home sellers. Collect property details and schedule valuation calls.",
    tools=[create_valuation_lead],
)

rental_agent = Agent(
    name="Rental Specialist",
    instructions="You help rental inquiries. Collect preferences and schedule showings.",
    tools=[search_rentals, book_showing],
)

triage = Agent(
    name="Triage",
    instructions=(
        "Greet the caller and identify whether they are buying, selling, or renting. "
        "Hand off to the correct specialist as soon as you know."
    ),
    handoffs=[handoff(buyer_agent), handoff(seller_agent), handoff(rental_agent)],
)

from agents import RunContext

class SessionState:
    def __init__(self, call_id: str, caller_phone: str):
        self.call_id = call_id
        self.caller_phone = caller_phone
        self.collected = {}

3. Run the loop

async def run_call(call_id: str, caller_phone: str, user_turns: list[str]):
    state = SessionState(call_id, caller_phone)
    messages = []
    for user_text in user_turns:
        messages.append({"role": "user", "content": user_text})
        result = await Runner.run(triage, input=messages, context=state)
        messages.append({"role": "assistant", "content": result.final_output})

4. Handle handoffs cleanly

The SDK emits a HandoffEvent when one agent transfers to another. Use it to log the handoff and keep the shared state consistent.

flowchart LR
    INPUT(["User input"])
    AGENT["Agent<br/>name plus instructions"]
    HAND{"Handoff to<br/>another agent?"}
    SUB["Sub-agent<br/>specialist"]
    GUARD{"Guardrail<br/>passed?"}
    TOOL["Tool call"]
    SDK[("Tracing<br/>OpenAI dashboard")]
    OUT(["Final output"])
    INPUT --> AGENT --> HAND
    HAND -->|Yes| SUB --> GUARD
    HAND -->|No| GUARD
    GUARD -->|Yes| TOOL --> AGENT
    GUARD -->|Block| OUT
    AGENT --> OUT
    AGENT --> SDK
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style SDK fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

from agents import HandoffEvent

async def observe(result):
    for event in result.events:
        if isinstance(event, HandoffEvent):
            await log_handoff(event.from_agent, event.to_agent, event.reason)

5. Bridge to the Realtime API

Route the user's audio-derived transcripts into the Runner and pipe the final_output back to the TTS side of the Realtime session. Keep one agent-SDK context per call.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

6. Guardrails per agent

Each specialist gets its own constraints: the buyer agent cannot book valuations, the seller agent cannot search listings. This prevents the combined prompt bloat that kills single-agent systems.

Production considerations

State scope: shared session state is fine; shared mutable global state is not.
Handoff loops: add a max-handoff counter; the SDK can recover from loops but it is expensive.
Tool permissions: agents only see the tools they need.
Telemetry: record which agent handled each turn for post-call analytics.
Handoff summaries: the outgoing agent should summarize what it learned so the incoming agent does not re-ask.

CallSphere's real implementation

CallSphere uses the OpenAI Agents SDK for every multi-agent vertical. Real estate runs 10 agents (triage, buyer, seller, rental, tour coordinator, qualification, finance, showing, negotiation, handoff-to-human). Healthcare combines 14 tools behind a lighter triage/specialist split. Salon runs 4 agents (receptionist, booking, upsell, recovery). After-hours escalation has 7 tools around an urgency-classifier triage. IT helpdesk pairs 10 tools with RAG behind a triage agent. The sales pod uses 5 GPT-4 specialists plus ElevenLabs TTS.

The voice plane under all of them is the OpenAI Realtime API with gpt-4o-realtime-preview-2025-06-03, PCM16 at 24kHz, and server VAD. Handoffs happen inside a single Realtime session so there is no audio drop between agents. A GPT-4o-mini post-call pipeline writes per-agent metrics so customers can see which specialist is closing and which is leaking. CallSphere supports 57+ languages with sub-second end-to-end latency.

Common pitfalls

Too many agents: 3-10 is a sweet spot; 20 is usually over-decomposed.
Specialists that re-ask basics: use handoff summaries.
Shared tools across specialists: defeats the point of role separation.
Handoff loops: cap the count and escalate on loop.
Ignoring per-agent evals: regressions hide in aggregate metrics.

FAQ

Can I use this without the Realtime API?

Yes. The Agents SDK is transport-agnostic; Realtime is just one front-end.

How do I A/B test a single agent in a multi-agent graph?

Version the agent separately and route X% of triage handoffs to the new version.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

What is a reasonable number of tools per specialist?

3-10. Past 15 the model starts confusing tool signatures.

How do I handle human escalation?

Add a transfer_to_human tool on every specialist and a dedicated escalation agent.

Does handoff cost extra tokens?

Yes, but less than the equivalent monolithic prompt.

Next steps

Want to see a 10-agent real-estate stack running live? Book a demo, read the technology page, or see pricing.

#CallSphere #OpenAIAgentsSDK #MultiAgent #VoiceAI #Orchestration #Handoffs #AIVoiceAgents

Building Multi-Agent Voice Systems with the OpenAI Agents SDK

Why one agent is not enough

Architecture overview

Prerequisites

Step-by-step walkthrough

1. Define the triage agent

3. Run the loop

4. Handle handoffs cleanly

5. Bridge to the Realtime API

6. Guardrails per agent

Production considerations

CallSphere's real implementation

Common pitfalls

FAQ

Can I use this without the Realtime API?

How do I A/B test a single agent in a multi-agent graph?

What is a reasonable number of tools per specialist?

How do I handle human escalation?

Does handoff cost extra tokens?

Next steps

Try CallSphere AI Voice Agents

Related Articles You May Like

Human-in-the-Loop Hybrid Agents: 73% Fewer Errors in 2026

Token-Level Evaluation of Streaming Agents: TTFT, Stream Smoothness, and Mid-Stream Hallucination Detection

OpenAI Agents SDK vs Assistants API in 2026: Migration Guide with Eval Parity

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Streaming Agent Responses with OpenAI Agents SDK and LangChain in 2026

Tool Selection Accuracy: The Eval Most Teams Skip — and Should Not (2026)

Why one agent is not enough

Architecture overview

Prerequisites

Step-by-step walkthrough

1. Define the triage agent

2. Share session state

3. Run the loop

4. Handle handoffs cleanly

5. Bridge to the Realtime API

6. Guardrails per agent

Production considerations

CallSphere's real implementation

Common pitfalls

FAQ

Can I use this without the Realtime API?

How do I A/B test a single agent in a multi-agent graph?

What is a reasonable number of tools per specialist?

How do I handle human escalation?

Does handoff cost extra tokens?

Next steps

Try CallSphere AI Voice Agents

Related Articles You May Like

Human-in-the-Loop Hybrid Agents: 73% Fewer Errors in 2026

Token-Level Evaluation of Streaming Agents: TTFT, Stream Smoothness, and Mid-Stream Hallucination Detection

OpenAI Agents SDK vs Assistants API in 2026: Migration Guide with Eval Parity

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Streaming Agent Responses with OpenAI Agents SDK and LangChain in 2026

Tool Selection Accuracy: The Eval Most Teams Skip — and Should Not (2026)