FastAPI Middleware for AI Agents: Logging, Auth, and Rate Limiting

The Middleware Stack for AI Agent APIs

Middleware sits between the incoming HTTP request and your endpoint handler. For AI agent backends, a proper middleware stack handles cross-cutting concerns: logging every request for debugging, authenticating callers before they reach agent endpoints, rate limiting to prevent LLM cost overruns, and adding CORS headers for browser-based agent frontends.

FastAPI middleware executes in the order it is added, wrapping your endpoint like layers of an onion. The first middleware added is the outermost layer, meaning it sees the request first and the response last.

Structured Request Logging

Every AI agent request should be logged with enough context to debug issues in production. This middleware captures timing, status codes, and request metadata:

flowchart LR
    CLIENT(["Client SDK"])
    GW["API Gateway<br/>auth plus rate limit"]
    APP["FastAPI app<br/>handlers and DI"]
    VAL["Pydantic validation"]
    SVC["Service layer<br/>business logic"]
    DB[(Database)]
    QUEUE[(Background queue)]
    OBS[(Tracing)]
    CLIENT --> GW --> APP --> VAL --> SVC
    SVC --> DB
    SVC --> QUEUE
    SVC --> OBS
    SVC --> CLIENT
    style GW fill:#4f46e5,stroke:#4338ca,color:#fff
    style APP fill:#f59e0b,stroke:#d97706,color:#1f2937
    style DB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b

import time
import uuid
import logging
from fastapi import Request

logger = logging.getLogger("agent_api")

@app.middleware("http")
async def logging_middleware(request: Request, call_next):
    request_id = str(uuid.uuid4())[:8]
    request.state.request_id = request_id

    start_time = time.monotonic()

    # Log request
    logger.info(
        "request_started",
        extra={
            "request_id": request_id,
            "method": request.method,
            "path": request.url.path,
            "client_ip": request.client.host,
        },
    )

    try:
        response = await call_next(request)
        duration_ms = (time.monotonic() - start_time) * 1000

        logger.info(
            "request_completed",
            extra={
                "request_id": request_id,
                "status_code": response.status_code,
                "duration_ms": round(duration_ms, 2),
                "path": request.url.path,
            },
        )

        response.headers["X-Request-ID"] = request_id
        response.headers["X-Response-Time"] = f"{duration_ms:.0f}ms"
        return response

    except Exception as e:
        duration_ms = (time.monotonic() - start_time) * 1000
        logger.error(
            "request_failed",
            extra={
                "request_id": request_id,
                "error": str(e),
                "duration_ms": round(duration_ms, 2),
            },
        )
        raise

The X-Request-ID header lets clients and support teams correlate frontend errors with backend logs.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Token-Based Authentication Middleware

AI agent APIs should authenticate every request. This middleware validates Bearer tokens and attaches user context to the request:

from fastapi import Request, HTTPException
from fastapi.responses import JSONResponse
import jwt

SKIP_AUTH_PATHS = {"/health", "/docs", "/openapi.json"}

@app.middleware("http")
async def auth_middleware(request: Request, call_next):
    if request.url.path in SKIP_AUTH_PATHS:
        return await call_next(request)

    auth_header = request.headers.get("Authorization")
    if not auth_header or not auth_header.startswith("Bearer "):
        return JSONResponse(
            status_code=401,
            content={"error": "Missing or invalid auth token"},
        )

    token = auth_header.split(" ", 1)[1]

    try:
        payload = jwt.decode(
            token,
            settings.jwt_secret,
            algorithms=["HS256"],
        )
        request.state.user_id = payload["sub"]
        request.state.user_tier = payload.get("tier", "free")
    except jwt.ExpiredSignatureError:
        return JSONResponse(
            status_code=401,
            content={"error": "Token expired"},
        )
    except jwt.InvalidTokenError:
        return JSONResponse(
            status_code=401,
            content={"error": "Invalid token"},
        )

    return await call_next(request)

Notice this uses JSONResponse instead of raising HTTPException. Inside middleware, raising exceptions can bypass other middleware layers. Returning a response directly is safer.

Sliding Window Rate Limiting

AI agent APIs are expensive because every request triggers LLM calls. Rate limiting prevents abuse and cost overruns. This implementation uses Redis for a sliding window algorithm:

import redis.asyncio as redis

redis_client = redis.from_url("redis://localhost:6379/2")

RATE_LIMITS = {
    "free": {"requests": 20, "window_seconds": 3600},
    "pro": {"requests": 200, "window_seconds": 3600},
    "enterprise": {"requests": 2000, "window_seconds": 3600},
}

@app.middleware("http")
async def rate_limit_middleware(request: Request, call_next):
    if request.url.path in SKIP_AUTH_PATHS:
        return await call_next(request)

    user_id = getattr(request.state, "user_id", "anonymous")
    user_tier = getattr(request.state, "user_tier", "free")
    limits = RATE_LIMITS[user_tier]

    key = f"ratelimit:{user_id}"
    now = time.time()
    window_start = now - limits["window_seconds"]

    pipe = redis_client.pipeline()
    # Remove old entries outside the window
    pipe.zremrangebyscore(key, 0, window_start)
    # Count remaining entries
    pipe.zcard(key)
    # Add current request
    pipe.zadd(key, {str(now): now})
    # Set expiry on the key
    pipe.expire(key, limits["window_seconds"])
    results = await pipe.execute()

    request_count = results[1]

    if request_count >= limits["requests"]:
        retry_after = int(limits["window_seconds"])
        return JSONResponse(
            status_code=429,
            content={
                "error": "Rate limit exceeded",
                "limit": limits["requests"],
                "window": f"{limits['window_seconds']}s",
                "retry_after": retry_after,
            },
            headers={"Retry-After": str(retry_after)},
        )

    response = await call_next(request)
    remaining = limits["requests"] - request_count - 1
    response.headers["X-RateLimit-Limit"] = str(limits["requests"])
    response.headers["X-RateLimit-Remaining"] = str(max(0, remaining))
    return response

The Redis sorted set tracks each request timestamp. On each new request, old entries outside the window are pruned, the current count is checked, and the new request is added. This gives an accurate sliding window rather than a fixed window that resets.

CORS Configuration

Browser-based agent frontends need proper CORS headers:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

from fastapi.middleware.cors import CORSMiddleware

app.add_middleware(
    CORSMiddleware,
    allow_origins=[
        "https://app.yourdomain.com",
        "http://localhost:3000",
    ],
    allow_credentials=True,
    allow_methods=["GET", "POST", "PUT", "DELETE"],
    allow_headers=["Authorization", "Content-Type"],
    expose_headers=[
        "X-Request-ID",
        "X-RateLimit-Remaining",
    ],
)

Add CORS middleware last so it is the outermost layer and properly handles preflight OPTIONS requests before any other middleware runs.

FAQ

What is the correct order for middleware in a FastAPI AI agent API?

Add middleware in this order: CORS (outermost, handles preflight), logging (captures all requests including rejected ones), authentication (rejects unauthenticated requests early), rate limiting (checks limits for authenticated users). Since FastAPI middleware wraps in reverse order of addition, add CORS last in your code so it executes first. This ensures OPTIONS preflight requests get CORS headers without triggering auth or rate limiting.

Should I use middleware or Dependencies for authentication?

Middleware is better when every endpoint needs authentication because it runs automatically without any per-endpoint configuration. Dependencies are better when only some endpoints need auth, or when different endpoints need different auth levels. A common pattern is using middleware for basic token validation and a dependency for fine-grained permission checks on specific endpoints.

How do I handle rate limiting for streaming endpoints?

Count the initial request, not individual streamed chunks. A streaming response that sends 500 tokens is still one API request from a rate limiting perspective. However, you may want to track token usage separately for billing purposes. Use the logging middleware to record total tokens consumed per request and apply token-based quotas as a separate check from request-count rate limiting.

#FastAPI #Middleware #Authentication #RateLimiting #AIAgents #AgenticAI #LearnAI #AIEngineering

FastAPI Middleware for AI Agents: Logging, Auth, and Rate Limiting

The Middleware Stack for AI Agent APIs

Structured Request Logging

Token-Based Authentication Middleware

Sliding Window Rate Limiting

CORS Configuration

FAQ

What is the correct order for middleware in a FastAPI AI agent API?

Should I use middleware or Dependencies for authentication?

How do I handle rate limiting for streaming endpoints?

Try CallSphere AI Voice Agents

Related Articles You May Like

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

LangGraph Supervisor Pattern: Orchestrating Multi-Agent Teams in 2026