Skip to content
Learn Agentic AI
Learn Agentic AI11 min read4 views

FastAPI Middleware for AI Agents: Logging, Auth, and Rate Limiting

Build a production middleware stack for AI agent APIs in FastAPI. Covers structured request logging, Bearer token authentication, sliding window rate limiting, and CORS configuration for agent frontends.

The Middleware Stack for AI Agent APIs

Middleware sits between the incoming HTTP request and your endpoint handler. For AI agent backends, a proper middleware stack handles cross-cutting concerns: logging every request for debugging, authenticating callers before they reach agent endpoints, rate limiting to prevent LLM cost overruns, and adding CORS headers for browser-based agent frontends.

FastAPI middleware executes in the order it is added, wrapping your endpoint like layers of an onion. The first middleware added is the outermost layer, meaning it sees the request first and the response last.

Structured Request Logging

Every AI agent request should be logged with enough context to debug issues in production. This middleware captures timing, status codes, and request metadata:

flowchart LR
    CLIENT(["Client SDK"])
    GW["API Gateway<br/>auth plus rate limit"]
    APP["FastAPI app<br/>handlers and DI"]
    VAL["Pydantic validation"]
    SVC["Service layer<br/>business logic"]
    DB[(Database)]
    QUEUE[(Background queue)]
    OBS[(Tracing)]
    CLIENT --> GW --> APP --> VAL --> SVC
    SVC --> DB
    SVC --> QUEUE
    SVC --> OBS
    SVC --> CLIENT
    style GW fill:#4f46e5,stroke:#4338ca,color:#fff
    style APP fill:#f59e0b,stroke:#d97706,color:#1f2937
    style DB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
import time
import uuid
import logging
from fastapi import Request

logger = logging.getLogger("agent_api")

@app.middleware("http")
async def logging_middleware(request: Request, call_next):
    request_id = str(uuid.uuid4())[:8]
    request.state.request_id = request_id

    start_time = time.monotonic()

    # Log request
    logger.info(
        "request_started",
        extra={
            "request_id": request_id,
            "method": request.method,
            "path": request.url.path,
            "client_ip": request.client.host,
        },
    )

    try:
        response = await call_next(request)
        duration_ms = (time.monotonic() - start_time) * 1000

        logger.info(
            "request_completed",
            extra={
                "request_id": request_id,
                "status_code": response.status_code,
                "duration_ms": round(duration_ms, 2),
                "path": request.url.path,
            },
        )

        response.headers["X-Request-ID"] = request_id
        response.headers["X-Response-Time"] = f"{duration_ms:.0f}ms"
        return response

    except Exception as e:
        duration_ms = (time.monotonic() - start_time) * 1000
        logger.error(
            "request_failed",
            extra={
                "request_id": request_id,
                "error": str(e),
                "duration_ms": round(duration_ms, 2),
            },
        )
        raise

The X-Request-ID header lets clients and support teams correlate frontend errors with backend logs.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Token-Based Authentication Middleware

AI agent APIs should authenticate every request. This middleware validates Bearer tokens and attaches user context to the request:

from fastapi import Request, HTTPException
from fastapi.responses import JSONResponse
import jwt

SKIP_AUTH_PATHS = {"/health", "/docs", "/openapi.json"}

@app.middleware("http")
async def auth_middleware(request: Request, call_next):
    if request.url.path in SKIP_AUTH_PATHS:
        return await call_next(request)

    auth_header = request.headers.get("Authorization")
    if not auth_header or not auth_header.startswith("Bearer "):
        return JSONResponse(
            status_code=401,
            content={"error": "Missing or invalid auth token"},
        )

    token = auth_header.split(" ", 1)[1]

    try:
        payload = jwt.decode(
            token,
            settings.jwt_secret,
            algorithms=["HS256"],
        )
        request.state.user_id = payload["sub"]
        request.state.user_tier = payload.get("tier", "free")
    except jwt.ExpiredSignatureError:
        return JSONResponse(
            status_code=401,
            content={"error": "Token expired"},
        )
    except jwt.InvalidTokenError:
        return JSONResponse(
            status_code=401,
            content={"error": "Invalid token"},
        )

    return await call_next(request)

Notice this uses JSONResponse instead of raising HTTPException. Inside middleware, raising exceptions can bypass other middleware layers. Returning a response directly is safer.

Sliding Window Rate Limiting

AI agent APIs are expensive because every request triggers LLM calls. Rate limiting prevents abuse and cost overruns. This implementation uses Redis for a sliding window algorithm:

import redis.asyncio as redis

redis_client = redis.from_url("redis://localhost:6379/2")

RATE_LIMITS = {
    "free": {"requests": 20, "window_seconds": 3600},
    "pro": {"requests": 200, "window_seconds": 3600},
    "enterprise": {"requests": 2000, "window_seconds": 3600},
}

@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
    if request.url.path in SKIP_AUTH_PATHS:
        return await call_next(request)

    user_id = getattr(request.state, "user_id", "anonymous")
    user_tier = getattr(request.state, "user_tier", "free")
    limits = RATE_LIMITS[user_tier]

    key = f"ratelimit:{user_id}"
    now = time.time()
    window_start = now - limits["window_seconds"]

    pipe = redis_client.pipeline()
    # Remove old entries outside the window
    pipe.zremrangebyscore(key, 0, window_start)
    # Count remaining entries
    pipe.zcard(key)
    # Add current request
    pipe.zadd(key, {str(now): now})
    # Set expiry on the key
    pipe.expire(key, limits["window_seconds"])
    results = await pipe.execute()

    request_count = results[1]

    if request_count >= limits["requests"]:
        retry_after = int(limits["window_seconds"])
        return JSONResponse(
            status_code=429,
            content={
                "error": "Rate limit exceeded",
                "limit": limits["requests"],
                "window": f"{limits['window_seconds']}s",
                "retry_after": retry_after,
            },
            headers={"Retry-After": str(retry_after)},
        )

    response = await call_next(request)
    remaining = limits["requests"] - request_count - 1
    response.headers["X-RateLimit-Limit"] = str(limits["requests"])
    response.headers["X-RateLimit-Remaining"] = str(max(0, remaining))
    return response

The Redis sorted set tracks each request timestamp. On each new request, old entries outside the window are pruned, the current count is checked, and the new request is added. This gives an accurate sliding window rather than a fixed window that resets.

CORS Configuration

Browser-based agent frontends need proper CORS headers:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

from fastapi.middleware.cors import CORSMiddleware

app.add_middleware(
    CORSMiddleware,
    allow_origins=[
        "https://app.yourdomain.com",
        "http://localhost:3000",
    ],
    allow_credentials=True,
    allow_methods=["GET", "POST", "PUT", "DELETE"],
    allow_headers=["Authorization", "Content-Type"],
    expose_headers=[
        "X-Request-ID",
        "X-RateLimit-Remaining",
    ],
)

Add CORS middleware last so it is the outermost layer and properly handles preflight OPTIONS requests before any other middleware runs.

FAQ

What is the correct order for middleware in a FastAPI AI agent API?

Add middleware in this order: CORS (outermost, handles preflight), logging (captures all requests including rejected ones), authentication (rejects unauthenticated requests early), rate limiting (checks limits for authenticated users). Since FastAPI middleware wraps in reverse order of addition, add CORS last in your code so it executes first. This ensures OPTIONS preflight requests get CORS headers without triggering auth or rate limiting.

Should I use middleware or Dependencies for authentication?

Middleware is better when every endpoint needs authentication because it runs automatically without any per-endpoint configuration. Dependencies are better when only some endpoints need auth, or when different endpoints need different auth levels. A common pattern is using middleware for basic token validation and a dependency for fine-grained permission checks on specific endpoints.

How do I handle rate limiting for streaming endpoints?

Count the initial request, not individual streamed chunks. A streaming response that sends 500 tokens is still one API request from a rate limiting perspective. However, you may want to track token usage separately for billing purposes. Use the logging middleware to record total tokens consumed per request and apply token-based quotas as a separate check from request-count rate limiting.


#FastAPI #Middleware #Authentication #RateLimiting #AIAgents #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Strategy

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Q1 2026 saw a record acquisition wave: Aircall bought Vogent (May), Meta acquired Manus and PlayAI, OpenAI closed six deals. The voice AI consolidation phase has begun.

Agentic AI

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

Use LangGraph's checkpointer to make agents resumable across crashes and human-in-the-loop pauses, then replay any checkpoint into your eval pipeline.

Agentic AI

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

How LangGraph's StateGraph, channels, and reducers actually work — with a working multi-step agent, eval hooks at every node, and the patterns that survive production.

Agentic AI

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

Handoffs done right — when one agent should hand control to another, how to preserve context, and how to evaluate the handoff decision itself.

Agentic AI

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.

Agentic AI

LangGraph Supervisor Pattern: Orchestrating Multi-Agent Teams in 2026

The supervisor pattern in LangGraph for coordinating specialist agents, with full code, an eval pipeline that scores routing accuracy, and the failure modes to watch for.