Error Recovery Patterns: Self-Healing Agents That Fix Their Own Mistakes

Beyond Crash and Retry: Agents That Correct Themselves

Traditional error handling stops at retry and abort. But LLM-powered agents have a unique capability that conventional software does not — they can reason about their own failures. When a tool call returns an error, the agent can read the error message, understand what went wrong, and try a different approach. This self-healing capability is what separates fragile demos from production-grade agents.

The challenge is building structured self-healing that is reliable, bounded, and observable.

The Self-Healing Loop

A self-healing agent wraps its execution in a loop that detects errors, diagnoses the cause, and applies a correction strategy.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

from dataclasses import dataclass, field
from enum import Enum
from typing import Callable, Optional
import logging

logger = logging.getLogger("agent.self_heal")

class RecoveryAction(Enum):
    RETRY_SAME = "retry_same"
    RETRY_MODIFIED = "retry_modified"
    USE_ALTERNATIVE = "use_alternative"
    ASK_USER = "ask_user"
    ESCALATE = "escalate"
    ABORT = "abort"

@dataclass
class ErrorDiagnosis:
    error_type: str
    root_cause: str
    recovery_action: RecoveryAction
    modified_args: Optional[dict] = None
    alternative_tool: Optional[str] = None
    user_message: Optional[str] = None

@dataclass
class HealingAttempt:
    diagnosis: ErrorDiagnosis
    success: bool
    result: Optional[dict] = None

class SelfHealingAgent:
    def __init__(self, llm_client, tool_registry: dict, max_healing_attempts: int = 3):
        self.llm = llm_client
        self.tools = tool_registry
        self.max_healing_attempts = max_healing_attempts
        self.healing_history: list[HealingAttempt] = []

    async def execute_with_healing(
        self, tool_name: str, args: dict, context: str = "",
    ) -> dict:
        """Execute a tool call with self-healing on failure."""
        # First attempt
        try:
            return await self._call_tool(tool_name, args)
        except Exception as first_error:
            logger.warning(f"Tool {tool_name} failed: {first_error}")

        # Self-healing loop
        last_error = first_error
        for attempt in range(self.max_healing_attempts):
            diagnosis = await self._diagnose_error(
                tool_name, args, last_error, context,
            )
            logger.info(
                f"Healing attempt {attempt + 1}: {diagnosis.recovery_action.value}"
            )

            if diagnosis.recovery_action == RecoveryAction.ABORT:
                raise RuntimeError(f"Unrecoverable: {diagnosis.root_cause}")

            if diagnosis.recovery_action == RecoveryAction.ASK_USER:
                return {"needs_input": True, "message": diagnosis.user_message}

            if diagnosis.recovery_action == RecoveryAction.ESCALATE:
                return {"escalated": True, "reason": diagnosis.root_cause}

            try:
                result = await self._apply_recovery(diagnosis, tool_name, args)
                self.healing_history.append(
                    HealingAttempt(diagnosis=diagnosis, success=True, result=result)
                )
                return result
            except Exception as exc:
                last_error = exc
                self.healing_history.append(
                    HealingAttempt(diagnosis=diagnosis, success=False)
                )

        raise RuntimeError(
            f"Failed after {self.max_healing_attempts} healing attempts"
        )

LLM-Powered Error Diagnosis

The agent uses its LLM to analyze the error and determine the best recovery strategy.

    async def _diagnose_error(
        self, tool_name: str, args: dict, error: Exception, context: str,
    ) -> ErrorDiagnosis:
        """Use the LLM to diagnose the error and recommend recovery."""
        diagnosis_prompt = f"""A tool call failed. Diagnose the error and recommend a recovery action.

Tool: {tool_name}
Arguments: {args}
Error: {type(error).__name__}: {error}
Context: {context}

Previous healing attempts for this request:
{self._format_history()}

Choose ONE recovery action:
- RETRY_MODIFIED: Fix the arguments and retry (provide corrected args)
- USE_ALTERNATIVE: Use a different tool (specify which)
- ASK_USER: Need clarification from the user (provide a question)
- ESCALATE: This needs human operator intervention
- ABORT: This cannot be recovered

Respond in this exact format:
ACTION: <action>
ROOT_CAUSE: <brief explanation>
MODIFIED_ARGS: <JSON if RETRY_MODIFIED, else null>
ALTERNATIVE_TOOL: <tool name if USE_ALTERNATIVE, else null>
USER_MESSAGE: <question if ASK_USER, else null>"""

        response = await self.llm.complete(diagnosis_prompt)
        return self._parse_diagnosis(response)

Structured Recovery Strategies

Each recovery action maps to a concrete execution path.

    async def _apply_recovery(
        self, diagnosis: ErrorDiagnosis, original_tool: str, original_args: dict,
    ) -> dict:
        if diagnosis.recovery_action == RecoveryAction.RETRY_SAME:
            return await self._call_tool(original_tool, original_args)

        elif diagnosis.recovery_action == RecoveryAction.RETRY_MODIFIED:
            modified = {**original_args, **(diagnosis.modified_args or {})}
            return await self._call_tool(original_tool, modified)

        elif diagnosis.recovery_action == RecoveryAction.USE_ALTERNATIVE:
            alt_tool = diagnosis.alternative_tool
            if alt_tool not in self.tools:
                raise ValueError(f"Alternative tool '{alt_tool}' not found")
            return await self._call_tool(alt_tool, original_args)

        raise ValueError(f"Unhandled recovery: {diagnosis.recovery_action}")

    async def _call_tool(self, tool_name: str, args: dict) -> dict:
        tool_fn = self.tools.get(tool_name)
        if not tool_fn:
            raise ValueError(f"Tool '{tool_name}' not registered")
        return await tool_fn(args)

    def _format_history(self) -> str:
        if not self.healing_history:
            return "None"
        lines = []
        for h in self.healing_history:
            lines.append(
                f"- {h.diagnosis.recovery_action.value}: "
                f"{'succeeded' if h.success else 'failed'} "
                f"(cause: {h.diagnosis.root_cause})"
            )
        return "\n".join(lines)

Feedback Loop for Continuous Improvement

Track which error patterns the agent encounters and how successfully it recovers. This data informs prompt improvements and tool hardening.

from collections import defaultdict

class HealingMetrics:
    def __init__(self):
        self.error_counts: dict[str, int] = defaultdict(int)
        self.recovery_success: dict[str, list[bool]] = defaultdict(list)

    def record(self, error_type: str, recovery_action: str, success: bool):
        key = f"{error_type}:{recovery_action}"
        self.error_counts[error_type] += 1
        self.recovery_success[key].append(success)

    def success_rate(self, error_type: str, recovery_action: str) -> float:
        key = f"{error_type}:{recovery_action}"
        results = self.recovery_success.get(key, [])
        if not results:
            return 0.0
        return sum(results) / len(results)

    def report(self) -> dict:
        report = {}
        for key, results in self.recovery_success.items():
            rate = sum(results) / len(results) if results else 0
            report[key] = {
                "attempts": len(results),
                "success_rate": round(rate, 2),
            }
        return report

Guardrails: Preventing Infinite Healing Loops

Always cap the number of healing attempts, track token spend during recovery, and prevent the agent from trying the same failed strategy twice.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

class HealingGuardrails:
    def __init__(self, max_attempts: int = 3, max_token_budget: int = 5000):
        self.max_attempts = max_attempts
        self.max_token_budget = max_token_budget
        self.tokens_used = 0
        self.tried_strategies: set[str] = set()

    def can_continue(self, attempt: int, proposed_action: str) -> bool:
        if attempt >= self.max_attempts:
            return False
        if self.tokens_used >= self.max_token_budget:
            return False
        if proposed_action in self.tried_strategies:
            return False
        return True

    def record_attempt(self, action: str, tokens: int):
        self.tried_strategies.add(action)
        self.tokens_used += tokens

FAQ

Is it safe to let the LLM decide how to fix its own errors?

Yes, with guardrails. The LLM's diagnosis should be constrained to a fixed set of recovery actions (the RecoveryAction enum). The agent code validates the proposed action and prevents unsafe operations like modifying arguments in ways that bypass business rules. The LLM provides intelligence; the code provides safety boundaries.

How do I prevent the agent from looping between two failing strategies?

Track all attempted strategies in a set and reject any strategy that has already been tried. The HealingGuardrails class above implements this. Additionally, include the full healing history in the diagnosis prompt so the LLM knows which approaches have already failed and can choose a different path.

When should self-healing escalate to a human?

Escalate when the error involves ambiguous user intent (the agent is unsure what the user wants), when the failure involves financial or irreversible actions, or when the maximum healing attempts are exhausted. The escalation path should capture the full context — original request, error, all healing attempts — so the human reviewer can resolve the issue without asking the user to repeat themselves.

#SelfHealing #ErrorRecovery #FeedbackLoops #AIAgents #Python #AgenticAI #LearnAI #AIEngineering

Error Recovery Patterns: Self-Healing Agents That Fix Their Own Mistakes

Beyond Crash and Retry: Agents That Correct Themselves

The Self-Healing Loop

LLM-Powered Error Diagnosis

Structured Recovery Strategies

Feedback Loop for Continuous Improvement

Guardrails: Preventing Infinite Healing Loops

FAQ

Is it safe to let the LLM decide how to fix its own errors?

How do I prevent the agent from looping between two failing strategies?

When should self-healing escalate to a human?

Try CallSphere AI Voice Agents

Related Articles You May Like

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

LangGraph Supervisor Pattern: Orchestrating Multi-Agent Teams in 2026