Kubernetes for AI Agents: Scaling Agent Workloads with K8s

Why Kubernetes for AI Agent Workloads

A single FastAPI container running your AI agent handles one user well. But production workloads demand horizontal scaling, automatic recovery from crashes, rolling updates without downtime, and resource isolation. Kubernetes provides all of this through declarative configuration — you describe the desired state, and K8s continuously reconciles reality to match.

AI agents present unique scaling challenges: requests are long-running (seconds to minutes per LLM call), memory usage spikes with large context windows, and traffic patterns are bursty. Kubernetes gives you the primitives to handle all of these.

The Deployment Manifest

A Deployment defines how many replicas of your agent pod to run and how to update them:

flowchart LR
    GIT(["Git push"])
    CI["GitHub Actions<br/>build plus test"]
    REG[("Container registry<br/>GHCR or ECR")]
    HELM["Helm chart<br/>values per env"]
    K8S{"Kubernetes cluster"}
    DEP["Deployment<br/>rolling update"]
    SVC["Service plus Ingress"]
    HPA["HPA<br/>CPU and queue depth"]
    POD[("Inference pods<br/>GPU node pool")]
    USERS(["Production traffic"])
    GIT --> CI --> REG --> HELM --> K8S
    K8S --> DEP --> POD
    K8S --> SVC --> POD
    K8S --> HPA --> POD
    SVC --> USERS
    style CI fill:#4f46e5,stroke:#4338ca,color:#fff
    style POD fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style USERS fill:#059669,stroke:#047857,color:#fff

# k8s/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: agent-service
  namespace: ai-agents
  labels:
    app: agent-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: agent-service
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  template:
    metadata:
      labels:
        app: agent-service
    spec:
      containers:
        - name: agent
          image: registry.example.com/agent-service:1.0.0
          ports:
            - containerPort: 8000
          env:
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: agent-secrets
                  key: openai-api-key
            - name: AGENT_MODEL
              value: "gpt-4o"
          resources:
            requests:
              cpu: "250m"
              memory: "512Mi"
            limits:
              cpu: "1000m"
              memory: "1Gi"
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 15
            timeoutSeconds: 5
          readinessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 5
            periodSeconds: 10

Key decisions here: resource requests guarantee minimum allocation so the scheduler places pods intelligently. Limits prevent a single agent from consuming all node resources during a large context window request.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Service for Internal Traffic

A Service gives your agent pods a stable DNS name and load balances traffic:

# k8s/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: agent-service
  namespace: ai-agents
spec:
  selector:
    app: agent-service
  ports:
    - port: 80
      targetPort: 8000
      protocol: TCP
  type: ClusterIP

Other services in the cluster reach the agent at http://agent-service.ai-agents.svc.cluster.local.

Secrets Management

Store your API keys in Kubernetes Secrets, not in Deployment manifests:

kubectl create secret generic agent-secrets \
  --namespace ai-agents \
  --from-literal=openai-api-key=sk-proj-your-key-here

Reference them in your Deployment with valueFrom.secretKeyRef as shown above.

Horizontal Pod Autoscaler

Scale pods automatically based on CPU utilization or custom metrics:

# k8s/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: agent-service-hpa
  namespace: ai-agents
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: agent-service
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 2
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300

The scaleDown.stabilizationWindowSeconds: 300 prevents thrashing — agent traffic is bursty, and you do not want Kubernetes removing pods only to recreate them a minute later.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Health Check Endpoint Design

Your /health endpoint should verify all critical dependencies:

@app.get("/health")
async def health():
    checks = {}
    try:
        await redis_client.ping()
        checks["redis"] = "ok"
    except Exception:
        checks["redis"] = "down"

    overall = "ok" if all(v == "ok" for v in checks.values()) else "degraded"
    status_code = 200 if overall == "ok" else 503
    return JSONResponse(
        content={"status": overall, "checks": checks},
        status_code=status_code,
    )

Applying the Manifests

kubectl create namespace ai-agents

kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml
kubectl apply -f k8s/hpa.yaml

# Watch the rollout
kubectl rollout status deployment/agent-service -n ai-agents

# Check pods
kubectl get pods -n ai-agents -l app=agent-service

FAQ

How should I set resource limits for AI agent pods?

Start by profiling your agent under realistic load. Most Python-based agents with FastAPI use 200-500 MB of RAM at baseline. Set memory requests at your p50 usage and limits at your p99. For CPU, LLM-backed agents are I/O-bound, so 250m-500m CPU request is typically sufficient. Monitor with kubectl top pods and adjust based on actual usage patterns.

What happens to in-flight agent requests during a rolling update?

Kubernetes sends a SIGTERM to the old pod and waits for terminationGracePeriodSeconds (default 30 seconds) before forcefully killing it. Handle SIGTERM in your FastAPI app by completing in-flight requests and rejecting new ones. Set the grace period longer than your maximum expected agent response time to prevent dropped requests.

Should I use one pod per agent type or multiplex agents in a single pod?

For most teams, a single service that handles all agent types is simpler to operate. Route to different agent behaviors via a request parameter. Only split into separate Deployments when agent types have significantly different resource profiles — for example, a coding agent that needs 4 GB of RAM versus a simple Q&A agent that needs 512 MB.

#Kubernetes #AIAgents #Scaling #DevOps #Infrastructure #AgenticAI #LearnAI #AIEngineering

Kubernetes for AI Agents: Scaling Agent Workloads with K8s

Why Kubernetes for AI Agent Workloads

The Deployment Manifest

Service for Internal Traffic

Secrets Management

Horizontal Pod Autoscaler

Health Check Endpoint Design

Applying the Manifests

FAQ

How should I set resource limits for AI agent pods?

What happens to in-flight agent requests during a rolling update?

Should I use one pod per agent type or multiplex agents in a single pod?

Try CallSphere AI Voice Agents

Related Articles You May Like

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

LangGraph Supervisor Pattern: Orchestrating Multi-Agent Teams in 2026