Skip to content
Learn Agentic AI
Learn Agentic AI13 min read9 views

Kubernetes Operators for AI Agents: Custom Controllers for Agent Lifecycle Management

Build a Kubernetes Operator for AI agent lifecycle management using Custom Resource Definitions, reconciliation loops, and status management to automate agent provisioning and scaling.

What Is a Kubernetes Operator

A Kubernetes Operator extends the Kubernetes API with custom resources and controllers that encode domain-specific operational knowledge. Instead of manually creating Deployments, Services, ConfigMaps, and HPAs for each AI agent, you define an AIAgent custom resource and let the Operator reconcile all the underlying infrastructure automatically.

This transforms agent deployment from "create six YAML files and apply them in the right order" to "declare what agent you want and let the Operator handle the rest."

Custom Resource Definition (CRD)

First, define what an AIAgent resource looks like:

flowchart LR
    GIT(["Git push"])
    CI["GitHub Actions<br/>build plus test"]
    REG[("Container registry<br/>GHCR or ECR")]
    HELM["Helm chart<br/>values per env"]
    K8S{"Kubernetes cluster"}
    DEP["Deployment<br/>rolling update"]
    SVC["Service plus Ingress"]
    HPA["HPA<br/>CPU and queue depth"]
    POD[("Inference pods<br/>GPU node pool")]
    USERS(["Production traffic"])
    GIT --> CI --> REG --> HELM --> K8S
    K8S --> DEP --> POD
    K8S --> SVC --> POD
    K8S --> HPA --> POD
    SVC --> USERS
    style CI fill:#4f46e5,stroke:#4338ca,color:#fff
    style POD fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style USERS fill:#059669,stroke:#047857,color:#fff
# crd-aiagent.yaml
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata:
  name: aiagents.ai.example.com
spec:
  group: ai.example.com
  versions:
    - name: v1alpha1
      served: true
      storage: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              required: ["model", "replicas"]
              properties:
                model:
                  type: string
                  description: "LLM model to use"
                replicas:
                  type: integer
                  minimum: 1
                  maximum: 100
                temperature:
                  type: number
                  default: 0.7
                maxTokens:
                  type: integer
                  default: 4096
                image:
                  type: string
                tools:
                  type: array
                  items:
                    type: string
                autoscaling:
                  type: object
                  properties:
                    enabled:
                      type: boolean
                      default: false
                    minReplicas:
                      type: integer
                    maxReplicas:
                      type: integer
            status:
              type: object
              properties:
                phase:
                  type: string
                readyReplicas:
                  type: integer
                lastUpdated:
                  type: string
                conditions:
                  type: array
                  items:
                    type: object
                    properties:
                      type:
                        type: string
                      status:
                        type: string
                      message:
                        type: string
      subresources:
        status: {}
      additionalPrinterColumns:
        - name: Model
          type: string
          jsonPath: .spec.model
        - name: Replicas
          type: integer
          jsonPath: .spec.replicas
        - name: Phase
          type: string
          jsonPath: .status.phase
  scope: Namespaced
  names:
    plural: aiagents
    singular: aiagent
    kind: AIAgent
    shortNames:
      - aia

Apply the CRD and now you can create AIAgent resources:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
# my-support-agent.yaml
apiVersion: ai.example.com/v1alpha1
kind: AIAgent
metadata:
  name: support-agent
  namespace: ai-agents
spec:
  model: "gpt-4o"
  replicas: 3
  temperature: 0.5
  maxTokens: 2048
  image: "myregistry/support-agent:2.0.0"
  tools:
    - "knowledge-base-search"
    - "ticket-creator"
    - "calendar-lookup"
  autoscaling:
    enabled: true
    minReplicas: 2
    maxReplicas: 15

Building the Operator in Python with Kopf

Kopf is a Python framework for building Kubernetes Operators. It handles watch streams, retry logic, and status updates.

# operator.py
import kopf
import kubernetes
from kubernetes import client

@kopf.on.create("ai.example.com", "v1alpha1", "aiagents")
async def create_agent(spec, name, namespace, logger, **kwargs):
    """Reconcile when a new AIAgent is created."""
    logger.info(f"Creating AI agent: {name}")

    apps_v1 = client.AppsV1Api()
    core_v1 = client.CoreV1Api()

    # Create ConfigMap with agent settings
    configmap = client.V1ConfigMap(
        metadata=client.V1ObjectMeta(
            name=f"{name}-config",
            namespace=namespace,
        ),
        data={
            "MODEL_NAME": spec.get("model", "gpt-4o"),
            "TEMPERATURE": str(spec.get("temperature", 0.7)),
            "MAX_TOKENS": str(spec.get("maxTokens", 4096)),
            "TOOLS": ",".join(spec.get("tools", [])),
        },
    )
    kopf.adopt(configmap)
    core_v1.create_namespaced_config_map(namespace, configmap)

    # Create Deployment
    deployment = build_deployment(name, namespace, spec)
    kopf.adopt(deployment)
    apps_v1.create_namespaced_deployment(namespace, deployment)

    # Create Service
    service = build_service(name, namespace, spec)
    kopf.adopt(service)
    core_v1.create_namespaced_service(namespace, service)

    return {"phase": "Running", "readyReplicas": 0}

def build_deployment(name: str, namespace: str, spec: dict):
    """Build a Deployment object from AIAgent spec."""
    return client.V1Deployment(
        metadata=client.V1ObjectMeta(
            name=name,
            namespace=namespace,
        ),
        spec=client.V1DeploymentSpec(
            replicas=spec.get("replicas", 1),
            selector=client.V1LabelSelector(
                match_labels={"aiagent": name}
            ),
            template=client.V1PodTemplateSpec(
                metadata=client.V1ObjectMeta(
                    labels={"aiagent": name}
                ),
                spec=client.V1PodSpec(
                    containers=[
                        client.V1Container(
                            name="agent",
                            image=spec["image"],
                            ports=[client.V1ContainerPort(
                                container_port=8000
                            )],
                            env_from=[
                                client.V1EnvFromSource(
                                    config_map_ref=client.V1ConfigMapEnvSource(
                                        name=f"{name}-config"
                                    )
                                )
                            ],
                        )
                    ]
                ),
            ),
        ),
    )

def build_service(name: str, namespace: str, spec: dict):
    return client.V1Service(
        metadata=client.V1ObjectMeta(
            name=f"{name}-svc",
            namespace=namespace,
        ),
        spec=client.V1ServiceSpec(
            selector={"aiagent": name},
            ports=[client.V1ServicePort(
                port=80, target_port=8000
            )],
        ),
    )

Handling Updates with the Reconciliation Loop

When someone changes the AIAgent spec, the Operator detects the diff and updates resources:

@kopf.on.update("ai.example.com", "v1alpha1", "aiagents")
async def update_agent(spec, name, namespace, diff, logger, **kwargs):
    """Reconcile when an AIAgent spec changes."""
    apps_v1 = client.AppsV1Api()
    core_v1 = client.CoreV1Api()

    for field, old_val, new_val in diff:
        logger.info(f"Field changed: {field} from {old_val} to {new_val}")

    # Update ConfigMap
    configmap_patch = {
        "data": {
            "MODEL_NAME": spec.get("model", "gpt-4o"),
            "TEMPERATURE": str(spec.get("temperature", 0.7)),
            "MAX_TOKENS": str(spec.get("maxTokens", 4096)),
        }
    }
    core_v1.patch_namespaced_config_map(
        f"{name}-config", namespace, configmap_patch
    )

    # Update Deployment replicas and image
    deployment_patch = {
        "spec": {
            "replicas": spec.get("replicas", 1),
            "template": {
                "spec": {
                    "containers": [{
                        "name": "agent",
                        "image": spec["image"],
                    }]
                }
            }
        }
    }
    apps_v1.patch_namespaced_deployment(
        name, namespace, deployment_patch
    )

    return {"phase": "Updating"}

Status Management

Update the custom resource status to reflect the actual state:

@kopf.timer("ai.example.com", "v1alpha1", "aiagents", interval=30)
async def monitor_agent(spec, name, namespace, patch, logger, **kwargs):
    """Periodically check agent health and update status."""
    apps_v1 = client.AppsV1Api()

    try:
        deployment = apps_v1.read_namespaced_deployment(name, namespace)
        ready = deployment.status.ready_replicas or 0
        desired = deployment.spec.replicas

        phase = "Running" if ready == desired else "Scaling"

        patch.status["readyReplicas"] = ready
        patch.status["phase"] = phase
        patch.status["lastUpdated"] = "2026-03-17T00:00:00Z"
    except kubernetes.client.exceptions.ApiException as e:
        patch.status["phase"] = "Error"
        logger.error(f"Failed to read deployment: {e}")

Using the Operator

Once deployed, managing agents becomes declarative:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

# Create an agent
kubectl apply -f my-support-agent.yaml

# List all agents
kubectl get aiagents -n ai-agents

# Scale an agent (edit the spec)
kubectl patch aiagent support-agent -n ai-agents \
  --type=merge -p '{"spec": {"replicas": 5}}'

# Delete an agent (cleans up all child resources)
kubectl delete aiagent support-agent -n ai-agents

FAQ

When should I build an Operator versus using Helm charts?

Use Helm when your deployment is a one-time packaging problem — you need to template and parameterize YAML. Build an Operator when you need ongoing lifecycle management — automatic scaling adjustments, health monitoring, backup scheduling, or coordinated multi-resource updates that respond to runtime conditions. Operators encode operational knowledge that Helm charts cannot express.

How do I test a Kubernetes Operator locally?

Use kind (Kubernetes in Docker) or minikube to run a local cluster. Kopf supports running outside the cluster with kopf run operator.py which connects to your kubeconfig context. Write integration tests that create custom resources and assert the expected child resources appear. Use pytest with the kubernetes client library to verify Deployment, Service, and ConfigMap creation.

What happens to child resources when the custom resource is deleted?

When you call kopf.adopt() on child resources, Kubernetes sets owner references. Deleting the parent AIAgent triggers garbage collection of all owned Deployments, Services, and ConfigMaps automatically. This prevents orphaned resources. Without adoption, you must handle cleanup manually in a @kopf.on.delete handler.


#KubernetesOperators #CRD #AIAgents #CustomControllers #Automation #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Strategy

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Q1 2026 saw a record acquisition wave: Aircall bought Vogent (May), Meta acquired Manus and PlayAI, OpenAI closed six deals. The voice AI consolidation phase has begun.

Agentic AI

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

How LangGraph's StateGraph, channels, and reducers actually work — with a working multi-step agent, eval hooks at every node, and the patterns that survive production.

Agentic AI

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

Use LangGraph's checkpointer to make agents resumable across crashes and human-in-the-loop pauses, then replay any checkpoint into your eval pipeline.

Agentic AI

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

Handoffs done right — when one agent should hand control to another, how to preserve context, and how to evaluate the handoff decision itself.

Agentic AI

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.

Agentic AI

LangGraph Supervisor Pattern: Orchestrating Multi-Agent Teams in 2026

The supervisor pattern in LangGraph for coordinating specialist agents, with full code, an eval pipeline that scores routing accuracy, and the failure modes to watch for.