Skip to content
Learn Agentic AI
Learn Agentic AI11 min read2 views

Building a Unified AI Agent API: One API for Chat, Voice, and Task Agents

Design a single unified API that serves chat, voice, and task-based AI agents through a common interface. Learn channel abstraction, response normalization, and how to handle the unique requirements of each modality without code duplication.

The Problem with Separate Agent APIs

Many organizations start with one API for their chatbot, another for their voice agent, and yet another for task automation. Each API has its own authentication, session management, error handling, and data models. Within months, you are maintaining three codebases that do fundamentally the same thing — send user input to an AI agent and return a response — but with incompatible interfaces.

A unified API consolidates these into a single interface with channel-specific adapters. The core logic — agent routing, conversation management, tool execution — lives in one place. Channel-specific concerns like voice transcription or chat formatting are handled at the edges.

The Unified Request Model

Design a request model that accommodates all channels through a common structure with channel-specific extensions:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart LR
    CLIENT(["Client SDK"])
    GW["API Gateway<br/>auth plus rate limit"]
    APP["FastAPI app<br/>handlers and DI"]
    VAL["Pydantic validation"]
    SVC["Service layer<br/>business logic"]
    DB[(Database)]
    QUEUE[(Background queue)]
    OBS[(Tracing)]
    CLIENT --> GW --> APP --> VAL --> SVC
    SVC --> DB
    SVC --> QUEUE
    SVC --> OBS
    SVC --> CLIENT
    style GW fill:#4f46e5,stroke:#4338ca,color:#fff
    style APP fill:#f59e0b,stroke:#d97706,color:#1f2937
    style DB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
from pydantic import BaseModel, Field
from typing import Any, Optional, Literal
from enum import Enum

class Channel(str, Enum):
    CHAT = "chat"
    VOICE = "voice"
    TASK = "task"
    EMAIL = "email"

class InputContent(BaseModel):
    text: Optional[str] = None
    audio_url: Optional[str] = None
    audio_base64: Optional[str] = None
    attachments: list[dict] = Field(default_factory=list)

class UnifiedRequest(BaseModel):
    channel: Channel
    session_id: str
    agent_id: str
    input: InputContent
    context: dict[str, Any] = Field(default_factory=dict)
    response_format: Literal["text", "ssml", "audio", "structured"] = "text"
    stream: bool = False

class ToolCallOutput(BaseModel):
    call_id: str
    tool_name: str
    arguments: dict[str, Any]

class UnifiedResponse(BaseModel):
    session_id: str
    agent_id: str
    channel: Channel
    text: Optional[str] = None
    ssml: Optional[str] = None
    audio_url: Optional[str] = None
    tool_calls: list[ToolCallOutput] = Field(default_factory=list)
    metadata: dict[str, Any] = Field(default_factory=dict)
    usage: dict[str, int] = Field(default_factory=dict)

A chat client sends {"channel": "chat", "input": {"text": "Hello"}}. A voice client sends {"channel": "voice", "input": {"audio_base64": "..."}}. A task agent sends {"channel": "task", "input": {"text": "Analyze this dataset"}}. The same endpoint handles all three.

Channel Adapters

Each channel has preprocessing and postprocessing needs. Adapters handle these transformations:

from abc import ABC, abstractmethod

class ChannelAdapter(ABC):
    @abstractmethod
    async def preprocess(self, request: UnifiedRequest) -> str:
        """Convert channel-specific input to plain text for the agent."""
        pass

    @abstractmethod
    async def postprocess(
        self, text: str, request: UnifiedRequest
    ) -> dict:
        """Convert agent text output to channel-specific format."""
        pass

class ChatAdapter(ChannelAdapter):
    async def preprocess(self, request: UnifiedRequest) -> str:
        return request.input.text or ""

    async def postprocess(self, text: str, request: UnifiedRequest) -> dict:
        return {"text": text}

class VoiceAdapter(ChannelAdapter):
    async def preprocess(self, request: UnifiedRequest) -> str:
        if request.input.audio_base64:
            return await transcribe_audio(request.input.audio_base64)
        return request.input.text or ""

    async def postprocess(self, text: str, request: UnifiedRequest) -> dict:
        if request.response_format == "ssml":
            return {"ssml": text_to_ssml(text)}
        if request.response_format == "audio":
            audio_url = await synthesize_speech(text)
            return {"audio_url": audio_url, "text": text}
        return {"text": text}

class TaskAdapter(ChannelAdapter):
    async def preprocess(self, request: UnifiedRequest) -> str:
        # Tasks may include structured instructions
        parts = [request.input.text or ""]
        for attachment in request.input.attachments:
            parts.append(f"[Attachment: {attachment.get('name', 'file')}]")
        return "\n".join(parts)

    async def postprocess(self, text: str, request: UnifiedRequest) -> dict:
        if request.response_format == "structured":
            return {"text": text, "metadata": {"structured": True}}
        return {"text": text}

ADAPTERS: dict[Channel, ChannelAdapter] = {
    Channel.CHAT: ChatAdapter(),
    Channel.VOICE: VoiceAdapter(),
    Channel.TASK: TaskAdapter(),
}

The Unified Endpoint

The main endpoint delegates to the appropriate adapter, runs the agent, and normalizes the response:

from fastapi import FastAPI

app = FastAPI(title="Unified Agent API")

@app.post("/v1/agent/invoke")
async def invoke_agent(request: UnifiedRequest) -> UnifiedResponse:
    adapter = ADAPTERS[request.channel]

    # Preprocess: convert channel input to text
    user_text = await adapter.preprocess(request)

    # Load conversation history
    history = await get_session_messages(request.session_id)

    # Run the agent
    agent_result = await run_agent(
        agent_id=request.agent_id,
        user_message=user_text,
        history=history,
        context=request.context,
    )

    # Postprocess: convert text to channel-appropriate format
    output = await adapter.postprocess(agent_result["text"], request)

    # Save to session history
    await save_message(request.session_id, "user", user_text)
    await save_message(request.session_id, "assistant", agent_result["text"])

    return UnifiedResponse(
        session_id=request.session_id,
        agent_id=request.agent_id,
        channel=request.channel,
        tool_calls=[
            ToolCallOutput(**tc) for tc in agent_result.get("tool_calls", [])
        ],
        usage=agent_result.get("usage", {}),
        **output,
    )

Streaming Across Channels

Streaming works differently per channel. Chat needs Server-Sent Events. Voice needs audio chunks. Tasks may not need streaming at all:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

from fastapi.responses import StreamingResponse
import json

@app.post("/v1/agent/stream")
async def stream_agent(request: UnifiedRequest):
    adapter = ADAPTERS[request.channel]
    user_text = await adapter.preprocess(request)
    history = await get_session_messages(request.session_id)

    async def event_stream():
        full_text = ""
        async for chunk in stream_agent_response(
            agent_id=request.agent_id,
            user_message=user_text,
            history=history,
        ):
            full_text += chunk["text"]
            output = await adapter.postprocess(chunk["text"], request)
            event_data = json.dumps({
                "session_id": request.session_id,
                "chunk": output,
                "done": chunk.get("done", False),
            })
            yield f"data: {event_data}\n\n"

        await save_message(request.session_id, "user", user_text)
        await save_message(request.session_id, "assistant", full_text)

    return StreamingResponse(event_stream(), media_type="text/event-stream")

FAQ

How do I handle channel-specific features like voice barge-in or chat typing indicators?

Add channel-specific metadata to the context field of the request and response. For voice barge-in, the client sends {"context": {"voice_barge_in": true}}. The voice adapter checks this flag and adjusts response behavior. Keep these features in the adapter layer, not in core agent logic.

Should the unified API normalize all responses to text, or preserve rich formats?

Always generate text as the canonical format, then let adapters transform it. The agent produces text. The chat adapter returns it as-is. The voice adapter converts it to SSML or audio. The task adapter may parse it into structured JSON. This keeps agent logic channel-agnostic.

How do I route to different agent implementations based on channel?

Add routing logic in the endpoint that selects the agent based on both agent_id and channel. A customer service agent might use a faster model for chat and a more capable model for complex task requests. Store this mapping in configuration rather than code.


#UnifiedAPI #AIAgents #APIDesign #FastAPI #MultiChannel #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Strategy

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Q1 2026 saw a record acquisition wave: Aircall bought Vogent (May), Meta acquired Manus and PlayAI, OpenAI closed six deals. The voice AI consolidation phase has begun.

Agentic AI

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

How LangGraph's StateGraph, channels, and reducers actually work — with a working multi-step agent, eval hooks at every node, and the patterns that survive production.

Agentic AI

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

Use LangGraph's checkpointer to make agents resumable across crashes and human-in-the-loop pauses, then replay any checkpoint into your eval pipeline.

Agentic AI

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

Handoffs done right — when one agent should hand control to another, how to preserve context, and how to evaluate the handoff decision itself.

Agentic AI

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.

Agentic AI

LangGraph Supervisor Pattern: Orchestrating Multi-Agent Teams in 2026

The supervisor pattern in LangGraph for coordinating specialist agents, with full code, an eval pipeline that scores routing accuracy, and the failure modes to watch for.