Kubernetes Fundamentals for AI Engineers: Pods, Deployments, and Services
Master core Kubernetes concepts every AI engineer needs — Pods, Deployments, Services, and ReplicaSets — with practical YAML manifests for deploying AI agent workloads to production clusters.
Why Kubernetes Matters for AI Agents
AI agents are not simple request-response APIs. They maintain conversation state, call external tools, spawn sub-agents, and consume unpredictable amounts of GPU and memory. Running them on a single server works for demos, but production demands orchestration — automatic restarts, scaling, rolling updates, and service discovery. Kubernetes provides all of this as a declarative platform.
This guide covers the three foundational Kubernetes resources you need to deploy any AI agent: Pods, Deployments, and Services.
Pods: The Smallest Deployable Unit
A Pod is one or more containers that share networking and storage. For an AI agent, a Pod typically contains the agent process itself and optionally a sidecar for logging or metrics.
flowchart LR
GIT(["Git push"])
CI["GitHub Actions<br/>build plus test"]
REG[("Container registry<br/>GHCR or ECR")]
HELM["Helm chart<br/>values per env"]
K8S{"Kubernetes cluster"}
DEP["Deployment<br/>rolling update"]
SVC["Service plus Ingress"]
HPA["HPA<br/>CPU and queue depth"]
POD[("Inference pods<br/>GPU node pool")]
USERS(["Production traffic"])
GIT --> CI --> REG --> HELM --> K8S
K8S --> DEP --> POD
K8S --> SVC --> POD
K8S --> HPA --> POD
SVC --> USERS
style CI fill:#4f46e5,stroke:#4338ca,color:#fff
style POD fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
style USERS fill:#059669,stroke:#047857,color:#fff
# ai-agent-pod.yaml
apiVersion: v1
kind: Pod
metadata:
name: ai-agent
labels:
app: ai-agent
tier: inference
spec:
containers:
- name: agent
image: myregistry/ai-agent:1.0.0
ports:
- containerPort: 8000
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: ai-secrets
key: openai-api-key
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 5
The resources block is critical for AI workloads. Without memory limits, a single agent loading a large model can starve other Pods on the node. The readinessProbe prevents Kubernetes from routing traffic to an agent that is still loading its model weights.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Deployments: Declarative Replica Management
You should never create Pods directly in production. A Deployment manages a set of identical Pod replicas and handles rolling updates.
# ai-agent-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ai-agent
namespace: ai-agents
spec:
replicas: 3
selector:
matchLabels:
app: ai-agent
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1
maxUnavailable: 0
template:
metadata:
labels:
app: ai-agent
version: v1.0.0
spec:
containers:
- name: agent
image: myregistry/ai-agent:1.0.0
ports:
- containerPort: 8000
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "2Gi"
cpu: "1000m"
Setting maxUnavailable: 0 ensures zero downtime during updates. Kubernetes creates new Pods first, waits for them to pass readiness checks, then terminates old Pods one at a time.
Services: Stable Network Endpoints
Pods get ephemeral IP addresses that change on restart. A Service provides a stable DNS name and load balances across healthy Pod replicas.
# ai-agent-service.yaml
apiVersion: v1
kind: Service
metadata:
name: ai-agent-svc
namespace: ai-agents
spec:
type: ClusterIP
selector:
app: ai-agent
ports:
- port: 80
targetPort: 8000
protocol: TCP
Other services in the cluster reach your agent at ai-agent-svc.ai-agents.svc.cluster.local. For external access, use a LoadBalancer type or an Ingress resource with TLS termination.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
Connecting From Python
Your Python application code does not change when moving to Kubernetes. The agent process simply listens on the configured port.
from fastapi import FastAPI
import os
app = FastAPI()
@app.get("/health")
async def health():
return {"status": "healthy"}
@app.post("/agent/invoke")
async def invoke_agent(request: dict):
# Agent logic here — Kubernetes handles scaling
api_key = os.environ["OPENAI_API_KEY"]
return {"response": "Agent processed your request"}
FAQ
When should I use a StatefulSet instead of a Deployment for AI agents?
Use a StatefulSet when your agent requires stable network identifiers or persistent storage that must survive Pod rescheduling. For example, agents that maintain a local vector store or checkpoint their conversation history to disk benefit from StatefulSets. Stateless agents that fetch all context from external databases should use Deployments.
How many replicas should I run for an AI agent Deployment?
Start with at least two replicas for high availability. Monitor CPU and memory utilization under realistic load, then scale. A common pattern is to set replicas to three for production and combine this with a Horizontal Pod Autoscaler that adjusts between two and ten replicas based on request latency or queue depth.
Can I run GPU workloads in Kubernetes Pods?
Yes. Install the NVIDIA device plugin for Kubernetes, then request GPUs in your resource spec with nvidia.com/gpu: 1 under limits. Kubernetes schedules the Pod only on nodes that have available GPUs. This is essential for agents that run local inference with models like Llama or Mistral.
#Kubernetes #AIDeployment #DevOps #Containers #Infrastructure #AgenticAI #LearnAI #AIEngineering
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.