Horizontal Pod Autoscaling for AI Agents: Scaling Based on Custom Metrics

Why AI Agents Need Autoscaling

AI agent workloads are inherently bursty. A customer support agent might handle 10 requests per minute during quiet hours and 500 during a product launch. Running enough replicas for peak load wastes money during idle periods. Running too few causes timeouts and dropped requests. Horizontal Pod Autoscaling (HPA) dynamically adjusts replica count based on observed metrics.

Basic HPA with CPU Metrics

The simplest HPA scales based on average CPU utilization across all Pods:

flowchart LR
    GIT(["Git push"])
    CI["GitHub Actions<br/>build plus test"]
    REG[("Container registry<br/>GHCR or ECR")]
    HELM["Helm chart<br/>values per env"]
    K8S{"Kubernetes cluster"}
    DEP["Deployment<br/>rolling update"]
    SVC["Service plus Ingress"]
    HPA["HPA<br/>CPU and queue depth"]
    POD[("Inference pods<br/>GPU node pool")]
    USERS(["Production traffic"])
    GIT --> CI --> REG --> HELM --> K8S
    K8S --> DEP --> POD
    K8S --> SVC --> POD
    K8S --> HPA --> POD
    SVC --> USERS
    style CI fill:#4f46e5,stroke:#4338ca,color:#fff
    style POD fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style USERS fill:#059669,stroke:#047857,color:#fff

# ai-agent-hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-agent-hpa
  namespace: ai-agents
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-agent
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 60
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 1
          periodSeconds: 120

The behavior section is critical for AI agents. Scale-up is aggressive — add up to four Pods per minute when load spikes. Scale-down is conservative — remove one Pod every two minutes with a five-minute stabilization window to avoid flapping during variable traffic.

Custom Metrics with Prometheus

CPU utilization is a poor proxy for AI agent load. A better metric is request queue depth or average response latency. Export custom metrics from your agent:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

from prometheus_client import Histogram, Gauge, start_http_server

# Track active agent sessions
active_sessions = Gauge(
    "ai_agent_active_sessions",
    "Number of active agent sessions"
)

# Track response latency
response_latency = Histogram(
    "ai_agent_response_seconds",
    "Time to generate agent response",
    buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 30.0]
)

# Start metrics server on a separate port
start_http_server(9090)

Configure HPA to use the custom metric via the Prometheus adapter:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ai-agent-hpa-custom
  namespace: ai-agents
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ai-agent
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Pods
      pods:
        metric:
          name: ai_agent_active_sessions
        target:
          type: AverageValue
          averageValue: "10"

This configuration maintains an average of 10 active sessions per Pod. When sessions increase, Kubernetes adds replicas. When sessions drop, it removes them.

KEDA: Event-Driven Autoscaling

KEDA (Kubernetes Event-Driven Autoscaling) extends HPA with scalers for queues, databases, and external services. It also supports scale-to-zero, which standard HPA does not.

Install KEDA:

helm repo add kedacore https://kedacore.github.io/charts
helm install keda kedacore/keda --namespace keda --create-namespace

Create a ScaledObject that scales based on a Redis queue:

# ai-agent-keda.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: ai-agent-scaler
  namespace: ai-agents
spec:
  scaleTargetRef:
    name: ai-agent
  pollingInterval: 10
  cooldownPeriod: 300
  minReplicaCount: 0
  maxReplicaCount: 30
  triggers:
    - type: redis
      metadata:
        address: redis-host:6379
        listName: agent-task-queue
        listLength: "5"
        activationListLength: "1"

With minReplicaCount: 0, the Deployment scales to zero Pods when the queue is empty, and activates when at least one message appears. This saves significant cost for agents that handle periodic batch workloads.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Scale-to-Zero Pattern for AI Agents

Scale-to-zero works well for batch agents but requires careful handling of cold starts:

import asyncio
import signal

class GracefulAgent:
    def __init__(self):
        self.running = True
        signal.signal(signal.SIGTERM, self._shutdown)

    def _shutdown(self, signum, frame):
        self.running = False

    async def process_queue(self):
        """Process tasks until shutdown signal."""
        while self.running:
            task = await self.fetch_from_queue(timeout=5)
            if task:
                await self.handle_task(task)

    async def fetch_from_queue(self, timeout: int):
        # Redis BRPOP with timeout
        pass

    async def handle_task(self, task: dict):
        # Agent processing logic
        pass

FAQ

What metrics should I use for autoscaling AI agents?

Avoid relying solely on CPU. The best metrics depend on your agent type. For synchronous request-response agents, use request latency (p95) or concurrent connections. For queue-based agents, use queue depth divided by processing rate. For WebSocket-based conversational agents, use active session count. Combine multiple metrics — Kubernetes scales to the highest recommendation from any single metric.

How do I prevent autoscaling from causing cost overruns?

Set hard maxReplicas limits, implement resource quotas at the namespace level, and configure PodDisruptionBudgets. Use cloud provider billing alerts as a safety net. With KEDA, the cooldownPeriod prevents premature scale-up oscillation that can multiply Pod count unnecessarily.

What is the cold start time for a scaled-to-zero AI agent?

Cold start includes container pull time, application startup, model loading, and health check passage. For a well-optimized AI agent image without local models, expect 5 to 15 seconds. Pre-pulled images on nodes reduce this to 2 to 5 seconds. If cold start latency is unacceptable, set minReplicaCount: 1 to keep one warm replica.

#Kubernetes #Autoscaling #KEDA #AIAgents #CostOptimization #AgenticAI #LearnAI #AIEngineering

Horizontal Pod Autoscaling for AI Agents: Scaling Based on Custom Metrics

Why AI Agents Need Autoscaling

Basic HPA with CPU Metrics

Custom Metrics with Prometheus

KEDA: Event-Driven Autoscaling

Scale-to-Zero Pattern for AI Agents

FAQ

What metrics should I use for autoscaling AI agents?

How do I prevent autoscaling from causing cost overruns?

What is the cold start time for a scaled-to-zero AI agent?

Try CallSphere AI Voice Agents

Related Articles You May Like

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Browser-side LLMs (WebGPU) in 2026?

Self-hosted on-prem stack for Browser-side LLMs (WebGPU): A May 2026 Comparison

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Edge / on-device LLM inference in 2026?

Self-hosted on-prem stack for Edge / on-device LLM inference: A May 2026 Comparison

Edge / on-device LLM inference in 2026: Open-source frontier matchup (DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Mistral Large 3)

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Multilingual customer support in 2026?