Kubernetes Fundamentals for AI Engineers: Pods, Deployments, and Services

Why Kubernetes Matters for AI Agents

AI agents are not simple request-response APIs. They maintain conversation state, call external tools, spawn sub-agents, and consume unpredictable amounts of GPU and memory. Running them on a single server works for demos, but production demands orchestration — automatic restarts, scaling, rolling updates, and service discovery. Kubernetes provides all of this as a declarative platform.

This guide covers the three foundational Kubernetes resources you need to deploy any AI agent: Pods, Deployments, and Services.

Pods: The Smallest Deployable Unit

A Pod is one or more containers that share networking and storage. For an AI agent, a Pod typically contains the agent process itself and optionally a sidecar for logging or metrics.

flowchart LR
    GIT(["Git push"])
    CI["GitHub Actions<br/>build plus test"]
    REG[("Container registry<br/>GHCR or ECR")]
    HELM["Helm chart<br/>values per env"]
    K8S{"Kubernetes cluster"}
    DEP["Deployment<br/>rolling update"]
    SVC["Service plus Ingress"]
    HPA["HPA<br/>CPU and queue depth"]
    POD[("Inference pods<br/>GPU node pool")]
    USERS(["Production traffic"])
    GIT --> CI --> REG --> HELM --> K8S
    K8S --> DEP --> POD
    K8S --> SVC --> POD
    K8S --> HPA --> POD
    SVC --> USERS
    style CI fill:#4f46e5,stroke:#4338ca,color:#fff
    style POD fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style USERS fill:#059669,stroke:#047857,color:#fff

# ai-agent-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: ai-agent
  labels:
    app: ai-agent
    tier: inference
spec:
  containers:
    - name: agent
      image: myregistry/ai-agent:1.0.0
      ports:
        - containerPort: 8000
      resources:
        requests:
          memory: "512Mi"
          cpu: "250m"
        limits:
          memory: "2Gi"
          cpu: "1000m"
      env:
        - name: OPENAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: ai-secrets
              key: openai-api-key
      readinessProbe:
        httpGet:
          path: /health
          port: 8000
        initialDelaySeconds: 10
        periodSeconds: 5

The resources block is critical for AI workloads. Without memory limits, a single agent loading a large model can starve other Pods on the node. The readinessProbe prevents Kubernetes from routing traffic to an agent that is still loading its model weights.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Deployments: Declarative Replica Management

You should never create Pods directly in production. A Deployment manages a set of identical Pod replicas and handles rolling updates.

# ai-agent-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ai-agent
  namespace: ai-agents
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ai-agent
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0
  template:
    metadata:
      labels:
        app: ai-agent
        version: v1.0.0
    spec:
      containers:
        - name: agent
          image: myregistry/ai-agent:1.0.0
          ports:
            - containerPort: 8000
          resources:
            requests:
              memory: "512Mi"
              cpu: "250m"
            limits:
              memory: "2Gi"
              cpu: "1000m"

Setting maxUnavailable: 0 ensures zero downtime during updates. Kubernetes creates new Pods first, waits for them to pass readiness checks, then terminates old Pods one at a time.

Services: Stable Network Endpoints

Pods get ephemeral IP addresses that change on restart. A Service provides a stable DNS name and load balances across healthy Pod replicas.

# ai-agent-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: ai-agent-svc
  namespace: ai-agents
spec:
  type: ClusterIP
  selector:
    app: ai-agent
  ports:
    - port: 80
      targetPort: 8000
      protocol: TCP

Other services in the cluster reach your agent at ai-agent-svc.ai-agents.svc.cluster.local. For external access, use a LoadBalancer type or an Ingress resource with TLS termination.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Connecting From Python

Your Python application code does not change when moving to Kubernetes. The agent process simply listens on the configured port.

from fastapi import FastAPI
import os

app = FastAPI()

@app.get("/health")
async def health():
    return {"status": "healthy"}

@app.post("/agent/invoke")
async def invoke_agent(request: dict):
    # Agent logic here — Kubernetes handles scaling
    api_key = os.environ["OPENAI_API_KEY"]
    return {"response": "Agent processed your request"}

FAQ

When should I use a StatefulSet instead of a Deployment for AI agents?

Use a StatefulSet when your agent requires stable network identifiers or persistent storage that must survive Pod rescheduling. For example, agents that maintain a local vector store or checkpoint their conversation history to disk benefit from StatefulSets. Stateless agents that fetch all context from external databases should use Deployments.

How many replicas should I run for an AI agent Deployment?

Start with at least two replicas for high availability. Monitor CPU and memory utilization under realistic load, then scale. A common pattern is to set replicas to three for production and combine this with a Horizontal Pod Autoscaler that adjusts between two and ten replicas based on request latency or queue depth.

Can I run GPU workloads in Kubernetes Pods?

Yes. Install the NVIDIA device plugin for Kubernetes, then request GPUs in your resource spec with nvidia.com/gpu: 1 under limits. Kubernetes schedules the Pod only on nodes that have available GPUs. This is essential for agents that run local inference with models like Llama or Mistral.

#Kubernetes #AIDeployment #DevOps #Containers #Infrastructure #AgenticAI #LearnAI #AIEngineering

Kubernetes Fundamentals for AI Engineers: Pods, Deployments, and Services

Why Kubernetes Matters for AI Agents

Pods: The Smallest Deployable Unit

Deployments: Declarative Replica Management

Services: Stable Network Endpoints

Connecting From Python

FAQ

When should I use a StatefulSet instead of a Deployment for AI agents?

How many replicas should I run for an AI agent Deployment?

Can I run GPU workloads in Kubernetes Pods?

Try CallSphere AI Voice Agents

Related Articles You May Like

AI infrastructure capex in 2026 — the superscale picture

K8s + Hostpath Backend Hot-Reload: CallSphere Edge Over Vapi Cloud

Data Center Power Constraints: Why AI Capex Is Now a Grid Problem

Agentic SDLC: How AI Changes Requirements, Design, Code Review, and Deployment

Agent CI/CD Patterns: Evals as Tests in 2026 Production Pipelines

Voice AI Concurrency at Scale: CallSphere vs Vapi 100+ Calls