AI Agent Deployment on Kubernetes: Scaling Patterns for Production
A practical guide to deploying and scaling AI agents on Kubernetes — from GPU scheduling and model serving to autoscaling strategies and cost-effective resource management.
Why Kubernetes for AI Agents
Kubernetes has become the default platform for deploying AI agents in production. Its container orchestration, auto-scaling, service discovery, and declarative configuration model align well with the requirements of multi-agent systems. But deploying AI workloads on Kubernetes requires patterns that differ from traditional web application deployments.
AI agents have unique resource requirements: GPU access for local model inference, high memory for context windows, variable latency requirements, and bursty compute patterns. This guide covers the patterns that work.
Deployment Architecture
Separating Agent Logic from Model Serving
The most maintainable architecture separates agent orchestration logic from model inference:
flowchart LR
GIT(["Git push"])
CI["GitHub Actions<br/>build plus test"]
REG[("Container registry<br/>GHCR or ECR")]
HELM["Helm chart<br/>values per env"]
K8S{"Kubernetes cluster"}
DEP["Deployment<br/>rolling update"]
SVC["Service plus Ingress"]
HPA["HPA<br/>CPU and queue depth"]
POD[("Inference pods<br/>GPU node pool")]
USERS(["Production traffic"])
GIT --> CI --> REG --> HELM --> K8S
K8S --> DEP --> POD
K8S --> SVC --> POD
K8S --> HPA --> POD
SVC --> USERS
style CI fill:#4f46e5,stroke:#4338ca,color:#fff
style POD fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
style USERS fill:#059669,stroke:#047857,color:#fff
# Agent deployment - CPU-only, handles orchestration
apiVersion: apps/v1
kind: Deployment
metadata:
name: customer-support-agent
spec:
replicas: 3
template:
spec:
containers:
- name: agent
image: myregistry/support-agent:v2.1
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"
env:
- name: LLM_ENDPOINT
value: "http://model-server:8000/v1"
- name: REDIS_URL
value: "redis://agent-cache:6379"
# Model server deployment - GPU-enabled
apiVersion: apps/v1
kind: Deployment
metadata:
name: model-server
spec:
replicas: 2
template:
spec:
containers:
- name: vllm
image: vllm/vllm-openai:latest
args: ["--model", "mistralai/Mistral-7B-Instruct-v0.3"]
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
nodeSelector:
gpu-type: a100
tolerations:
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
This separation lets you scale agent logic independently from model inference, upgrade models without redeploying agents, and share model servers across multiple agent types.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
GPU Scheduling Strategies
GPU resources are expensive. Maximize utilization with these approaches:
Time-sharing with MPS (Multi-Process Service): Run multiple inference workloads on the same GPU. Works well when individual requests do not saturate GPU compute.
Fractional GPUs: Use tools like nvidia-device-plugin with time-slicing or MIG (Multi-Instance GPU) on A100s to partition a single GPU into multiple smaller allocations.
Spot/Preemptible nodes: Run non-latency-critical workloads (batch processing, evaluation, fine-tuning) on spot instances for 60-70% cost savings.
Auto-Scaling Patterns
Horizontal Pod Autoscaler (HPA)
Standard CPU/memory-based HPA does not work well for AI workloads because inference is GPU-bound, not CPU-bound. Use custom metrics instead:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: model-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: model-server
minReplicas: 1
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: inference_queue_depth
target:
type: AverageValue
averageValue: "5" # Scale up when queue > 5 per pod
- type: Pods
pods:
metric:
name: gpu_utilization
target:
type: AverageValue
averageValue: "80" # Scale up when GPU > 80% utilized
KEDA (Kubernetes Event-Driven Autoscaling)
KEDA is particularly useful for event-driven agent architectures. Scale agent pods based on message queue depth:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: agent-scaler
spec:
scaleTargetRef:
name: customer-support-agent
minReplicaCount: 1
maxReplicaCount: 20
triggers:
- type: redis-streams
metadata:
address: agent-cache:6379
stream: agent-tasks
consumerGroup: support-agents
lagCount: "10" # Scale when 10+ messages pending
Networking and Service Mesh
gRPC for Model Serving
Use gRPC instead of REST for internal model serving. gRPC's binary protocol, HTTP/2 multiplexing, and streaming support reduce latency by 30-40% compared to REST for inference workloads.
Health Checks
AI model servers need custom health checks that go beyond TCP port checks:
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 120 # Models take time to load
periodSeconds: 30
readinessProbe:
httpGet:
path: /health/ready # Model loaded and warm
port: 8000
initialDelaySeconds: 180
periodSeconds: 10
Cost Optimization
- Right-size GPU instances: Profile your model's actual VRAM and compute requirements. Many teams over-provision by 50% or more
- Use node pools: Separate GPU and CPU node pools to avoid paying GPU prices for CPU-only workloads
- Implement scale-to-zero: For low-traffic agent types, use KEDA to scale to zero pods when idle
- Cache aggressively: Redis or Memcached for embedding caches, prompt caches, and response caches
Observability Stack
Deploy alongside your agents:
- Prometheus + Grafana: GPU utilization, inference latency, queue depth, token throughput
- OpenTelemetry Collector: Distributed tracing across multi-agent pipelines
- Loki or Elasticsearch: Structured logging for conversation debugging
The key to successful Kubernetes deployment of AI agents is treating model serving as infrastructure (stable, shared, GPU-optimized) and agent logic as application code (frequently deployed, independently scaled, CPU-based).
Sources:
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.