Skip to content
Learn Agentic AI
Learn Agentic AI11 min read10 views

Deploying AI Agents with FastAPI: REST Endpoints for Agent Interactions

Learn how to expose AI agents through production-grade FastAPI REST endpoints with async request handling, Pydantic validation, structured error responses, and streaming support.

Why FastAPI Is the Go-To Framework for Agent APIs

Building an AI agent is one challenge. Making it accessible to users, frontends, and other services over HTTP is another. FastAPI has become the dominant choice for serving AI agents in production because it is natively async, generates OpenAPI docs automatically, validates inputs with Pydantic, and handles concurrent requests efficiently — all qualities you need when wrapping long-running LLM calls behind an API.

In this guide, you will build a complete FastAPI service that exposes an AI agent through REST endpoints, handles errors gracefully, and returns structured responses.

Project Structure

A clean project layout keeps your agent logic separate from your HTTP layer:

flowchart LR
    CLIENT(["Client SDK"])
    GW["API Gateway<br/>auth plus rate limit"]
    APP["FastAPI app<br/>handlers and DI"]
    VAL["Pydantic validation"]
    SVC["Service layer<br/>business logic"]
    DB[(Database)]
    QUEUE[(Background queue)]
    OBS[(Tracing)]
    CLIENT --> GW --> APP --> VAL --> SVC
    SVC --> DB
    SVC --> QUEUE
    SVC --> OBS
    SVC --> CLIENT
    style GW fill:#4f46e5,stroke:#4338ca,color:#fff
    style APP fill:#f59e0b,stroke:#d97706,color:#1f2937
    style DB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
agent_service/
  app/
    __init__.py
    main.py          # FastAPI application
    routes/
      agent.py       # Agent endpoints
    models/
      schemas.py     # Request/response models
    services/
      agent_runner.py # Agent execution logic
    config.py        # Settings management

Defining Request and Response Models

Start with Pydantic models that enforce a contract between clients and your agent service:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
# app/models/schemas.py
from pydantic import BaseModel, Field
from typing import Optional
from enum import Enum

class AgentRole(str, Enum):
    assistant = "assistant"
    researcher = "researcher"
    coder = "coder"

class AgentRequest(BaseModel):
    message: str = Field(..., min_length=1, max_length=4000)
    session_id: Optional[str] = Field(None, description="Resume existing session")
    agent_role: AgentRole = AgentRole.assistant
    temperature: float = Field(0.7, ge=0.0, le=2.0)

class AgentResponse(BaseModel):
    session_id: str
    reply: str
    tokens_used: int
    model: str
    processing_time_ms: float

class ErrorResponse(BaseModel):
    error: str
    detail: Optional[str] = None
    request_id: str

Pydantic validates every incoming request automatically. A client sending temperature: 5.0 gets a clear 422 error without your agent ever being invoked.

Building the Agent Runner Service

Wrap your agent logic in a service class that the route layer calls:

# app/services/agent_runner.py
import time
import uuid
from agents import Agent, Runner

class AgentRunnerService:
    def __init__(self):
        self.sessions: dict[str, list] = {}

    async def run(self, message: str, session_id: str | None,
                  role: str, temperature: float) -> dict:
        sid = session_id or str(uuid.uuid4())
        history = self.sessions.get(sid, [])

        agent = Agent(
            name=role,
            instructions=f"You are a helpful {role} agent.",
            model="gpt-4o",
            temperature=temperature,
        )

        start = time.perf_counter()
        result = await Runner.run(agent, message, message_history=history)
        elapsed_ms = (time.perf_counter() - start) * 1000

        self.sessions[sid] = result.to_input_list()

        return {
            "session_id": sid,
            "reply": result.final_output,
            "tokens_used": result.raw_responses[-1].usage.total_tokens,
            "model": "gpt-4o",
            "processing_time_ms": round(elapsed_ms, 2),
        }

Creating the FastAPI Endpoint

Wire the service into async route handlers:

# app/routes/agent.py
from fastapi import APIRouter, HTTPException
from app.models.schemas import AgentRequest, AgentResponse, ErrorResponse
from app.services.agent_runner import AgentRunnerService

router = APIRouter(prefix="/api/v1/agent", tags=["Agent"])
runner_service = AgentRunnerService()

@router.post(
    "/chat",
    response_model=AgentResponse,
    responses={500: {"model": ErrorResponse}},
)
async def chat(request: AgentRequest):
    try:
        result = await runner_service.run(
            message=request.message,
            session_id=request.session_id,
            role=request.agent_role.value,
            temperature=request.temperature,
        )
        return AgentResponse(**result)
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

Application Entry Point with Lifespan Events

Use FastAPI lifespan events to initialize and clean up resources:

# app/main.py
from contextlib import asynccontextmanager
from fastapi import FastAPI
from app.routes.agent import router as agent_router

@asynccontextmanager
async def lifespan(app: FastAPI):
    print("Agent service starting up")
    yield
    print("Agent service shutting down")

app = FastAPI(
    title="AI Agent Service",
    version="1.0.0",
    lifespan=lifespan,
)
app.include_router(agent_router)

@app.get("/health")
async def health():
    return {"status": "ok"}

Run it with: uvicorn app.main:app --host 0.0.0.0 --port 8000

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Adding Rate Limiting and Timeouts

Protect your agent endpoints from abuse and runaway LLM calls:

import asyncio
from fastapi import HTTPException

AGENT_TIMEOUT_SECONDS = 30

@router.post("/chat", response_model=AgentResponse)
async def chat(request: AgentRequest):
    try:
        result = await asyncio.wait_for(
            runner_service.run(
                message=request.message,
                session_id=request.session_id,
                role=request.agent_role.value,
                temperature=request.temperature,
            ),
            timeout=AGENT_TIMEOUT_SECONDS,
        )
        return AgentResponse(**result)
    except asyncio.TimeoutError:
        raise HTTPException(status_code=504, detail="Agent timed out")

FAQ

How do I handle long-running agent tasks that exceed HTTP timeout limits?

Return an immediate 202 Accepted response with a task ID, then process the agent call in a background worker. Clients poll a GET /tasks/{task_id} endpoint or subscribe to a WebSocket for the result. This pattern is standard for any LLM call that may take more than 30 seconds.

Should I use sync or async endpoints for AI agents?

Always use async. LLM API calls are I/O-bound operations — they spend most of their time waiting for network responses. Async endpoints let FastAPI handle hundreds of concurrent agent requests on a single process, whereas sync endpoints would block the event loop and serialize all requests.

How do I version my agent API when prompts or models change?

Use URL path versioning (/api/v1/agent, /api/v2/agent) for breaking changes to the request/response schema. For non-breaking changes like prompt tweaks or model upgrades, use feature flags or the agent role parameter so clients can opt into new behavior without changing their integration code.


#FastAPI #AIAgents #RESTAPI #Python #Deployment #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Agentic AI

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

Handoffs done right — when one agent should hand control to another, how to preserve context, and how to evaluate the handoff decision itself.

AI Strategy

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Q1 2026 saw a record acquisition wave: Aircall bought Vogent (May), Meta acquired Manus and PlayAI, OpenAI closed six deals. The voice AI consolidation phase has begun.

Agentic AI

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.

Agentic AI

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

How LangGraph's StateGraph, channels, and reducers actually work — with a working multi-step agent, eval hooks at every node, and the patterns that survive production.

Agentic AI

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

Use LangGraph's checkpointer to make agents resumable across crashes and human-in-the-loop pauses, then replay any checkpoint into your eval pipeline.

Agentic AI

LangGraph Supervisor Pattern: Orchestrating Multi-Agent Teams in 2026

The supervisor pattern in LangGraph for coordinating specialist agents, with full code, an eval pipeline that scores routing accuracy, and the failure modes to watch for.