Building a Unified AI Agent API: One API for Chat, Voice, and Task Agents

The Problem with Separate Agent APIs

Many organizations start with one API for their chatbot, another for their voice agent, and yet another for task automation. Each API has its own authentication, session management, error handling, and data models. Within months, you are maintaining three codebases that do fundamentally the same thing — send user input to an AI agent and return a response — but with incompatible interfaces.

A unified API consolidates these into a single interface with channel-specific adapters. The core logic — agent routing, conversation management, tool execution — lives in one place. Channel-specific concerns like voice transcription or chat formatting are handled at the edges.

The Unified Request Model

Design a request model that accommodates all channels through a common structure with channel-specific extensions:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    CLIENT(["Client SDK"])
    GW["API Gateway<br/>auth plus rate limit"]
    APP["FastAPI app<br/>handlers and DI"]
    VAL["Pydantic validation"]
    SVC["Service layer<br/>business logic"]
    DB[(Database)]
    QUEUE[(Background queue)]
    OBS[(Tracing)]
    CLIENT --> GW --> APP --> VAL --> SVC
    SVC --> DB
    SVC --> QUEUE
    SVC --> OBS
    SVC --> CLIENT
    style GW fill:#4f46e5,stroke:#4338ca,color:#fff
    style APP fill:#f59e0b,stroke:#d97706,color:#1f2937
    style DB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b

from pydantic import BaseModel, Field
from typing import Any, Optional, Literal
from enum import Enum

class Channel(str, Enum):
    CHAT = "chat"
    VOICE = "voice"
    TASK = "task"
    EMAIL = "email"

class InputContent(BaseModel):
    text: Optional[str] = None
    audio_url: Optional[str] = None
    audio_base64: Optional[str] = None
    attachments: list[dict] = Field(default_factory=list)

class UnifiedRequest(BaseModel):
    channel: Channel
    session_id: str
    agent_id: str
    input: InputContent
    context: dict[str, Any] = Field(default_factory=dict)
    response_format: Literal["text", "ssml", "audio", "structured"] = "text"
    stream: bool = False

class ToolCallOutput(BaseModel):
    call_id: str
    tool_name: str
    arguments: dict[str, Any]

class UnifiedResponse(BaseModel):
    session_id: str
    agent_id: str
    channel: Channel
    text: Optional[str] = None
    ssml: Optional[str] = None
    audio_url: Optional[str] = None
    tool_calls: list[ToolCallOutput] = Field(default_factory=list)
    metadata: dict[str, Any] = Field(default_factory=dict)
    usage: dict[str, int] = Field(default_factory=dict)

A chat client sends {"channel": "chat", "input": {"text": "Hello"}}. A voice client sends {"channel": "voice", "input": {"audio_base64": "..."}}. A task agent sends {"channel": "task", "input": {"text": "Analyze this dataset"}}. The same endpoint handles all three.

Channel Adapters

Each channel has preprocessing and postprocessing needs. Adapters handle these transformations:

from abc import ABC, abstractmethod

class ChannelAdapter(ABC):
    @abstractmethod
    async def preprocess(self, request: UnifiedRequest) -> str:
        """Convert channel-specific input to plain text for the agent."""
        pass

    @abstractmethod
    async def postprocess(
        self, text: str, request: UnifiedRequest
    ) -> dict:
        """Convert agent text output to channel-specific format."""
        pass

class ChatAdapter(ChannelAdapter):
    async def preprocess(self, request: UnifiedRequest) -> str:
        return request.input.text or ""

    async def postprocess(self, text: str, request: UnifiedRequest) -> dict:
        return {"text": text}

class VoiceAdapter(ChannelAdapter):
    async def preprocess(self, request: UnifiedRequest) -> str:
        if request.input.audio_base64:
            return await transcribe_audio(request.input.audio_base64)
        return request.input.text or ""

    async def postprocess(self, text: str, request: UnifiedRequest) -> dict:
        if request.response_format == "ssml":
            return {"ssml": text_to_ssml(text)}
        if request.response_format == "audio":
            audio_url = await synthesize_speech(text)
            return {"audio_url": audio_url, "text": text}
        return {"text": text}

class TaskAdapter(ChannelAdapter):
    async def preprocess(self, request: UnifiedRequest) -> str:
        # Tasks may include structured instructions
        parts = [request.input.text or ""]
        for attachment in request.input.attachments:
            parts.append(f"[Attachment: {attachment.get('name', 'file')}]")
        return "\n".join(parts)

    async def postprocess(self, text: str, request: UnifiedRequest) -> dict:
        if request.response_format == "structured":
            return {"text": text, "metadata": {"structured": True}}
        return {"text": text}

ADAPTERS: dict[Channel, ChannelAdapter] = {
    Channel.CHAT: ChatAdapter(),
    Channel.VOICE: VoiceAdapter(),
    Channel.TASK: TaskAdapter(),
}

The Unified Endpoint

The main endpoint delegates to the appropriate adapter, runs the agent, and normalizes the response:

from fastapi import FastAPI

app = FastAPI(title="Unified Agent API")

@app.post("/v1/agent/invoke")
async def invoke_agent(request: UnifiedRequest) -> UnifiedResponse:
    adapter = ADAPTERS[request.channel]

    # Preprocess: convert channel input to text
    user_text = await adapter.preprocess(request)

    # Load conversation history
    history = await get_session_messages(request.session_id)

    # Run the agent
    agent_result = await run_agent(
        agent_id=request.agent_id,
        user_message=user_text,
        history=history,
        context=request.context,
    )

    # Postprocess: convert text to channel-appropriate format
    output = await adapter.postprocess(agent_result["text"], request)

    # Save to session history
    await save_message(request.session_id, "user", user_text)
    await save_message(request.session_id, "assistant", agent_result["text"])

    return UnifiedResponse(
        session_id=request.session_id,
        agent_id=request.agent_id,
        channel=request.channel,
        tool_calls=[
            ToolCallOutput(**tc) for tc in agent_result.get("tool_calls", [])
        ],
        usage=agent_result.get("usage", {}),
        **output,
    )

Streaming Across Channels

Streaming works differently per channel. Chat needs Server-Sent Events. Voice needs audio chunks. Tasks may not need streaming at all:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

from fastapi.responses import StreamingResponse
import json

@app.post("/v1/agent/stream")
async def stream_agent(request: UnifiedRequest):
    adapter = ADAPTERS[request.channel]
    user_text = await adapter.preprocess(request)
    history = await get_session_messages(request.session_id)

    async def event_stream():
        full_text = ""
        async for chunk in stream_agent_response(
            agent_id=request.agent_id,
            user_message=user_text,
            history=history,
        ):
            full_text += chunk["text"]
            output = await adapter.postprocess(chunk["text"], request)
            event_data = json.dumps({
                "session_id": request.session_id,
                "chunk": output,
                "done": chunk.get("done", False),
            })
            yield f"data: {event_data}\n\n"

        await save_message(request.session_id, "user", user_text)
        await save_message(request.session_id, "assistant", full_text)

    return StreamingResponse(event_stream(), media_type="text/event-stream")

FAQ

How do I handle channel-specific features like voice barge-in or chat typing indicators?

Add channel-specific metadata to the context field of the request and response. For voice barge-in, the client sends {"context": {"voice_barge_in": true}}. The voice adapter checks this flag and adjusts response behavior. Keep these features in the adapter layer, not in core agent logic.

Should the unified API normalize all responses to text, or preserve rich formats?

Always generate text as the canonical format, then let adapters transform it. The agent produces text. The chat adapter returns it as-is. The voice adapter converts it to SSML or audio. The task adapter may parse it into structured JSON. This keeps agent logic channel-agnostic.

How do I route to different agent implementations based on channel?

Add routing logic in the endpoint that selects the agent based on both agent_id and channel. A customer service agent might use a faster model for chat and a more capable model for complex task requests. Store this mapping in configuration rather than code.

#UnifiedAPI #AIAgents #APIDesign #FastAPI #MultiChannel #AgenticAI #LearnAI #AIEngineering

Building a Unified AI Agent API: One API for Chat, Voice, and Task Agents

The Problem with Separate Agent APIs

The Unified Request Model

Channel Adapters

The Unified Endpoint

Streaming Across Channels

FAQ

How do I handle channel-specific features like voice barge-in or chat typing indicators?

Should the unified API normalize all responses to text, or preserve rich formats?

How do I route to different agent implementations based on channel?

Try CallSphere AI Voice Agents

Related Articles You May Like

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

LangGraph Supervisor Pattern: Orchestrating Multi-Agent Teams in 2026