Skip to content
Learn Agentic AI
Learn Agentic AI11 min read17 views

Error Handling in LangGraph: Retry Nodes, Fallback Paths, and Recovery

Build resilient LangGraph workflows with try/except patterns in nodes, fallback conditional edges, configurable retry logic, and dead-end recovery strategies for production agent systems.

Errors Are Inevitable in Agent Systems

Agent workflows interact with external systems — LLM APIs, databases, web services, file systems. Any of these can fail. API rate limits, network timeouts, malformed LLM outputs, and tool execution errors are not edge cases — they are normal operating conditions. Production LangGraph workflows must handle errors gracefully rather than crashing and losing all accumulated state.

Error Handling Inside Nodes

The first line of defense is try/except blocks within node functions:

flowchart TD
    USER(["User input"])
    SUPER["Supervisor node<br/>routes by state"]
    A["Specialist node A<br/>research"]
    B["Specialist node B<br/>writing"]
    TOOL{"Tool call<br/>needed?"}
    EXEC["Tool executor<br/>ToolNode"]
    CHK[("Postgres<br/>checkpointer")]
    INT{"interrupt for<br/>human approval?"}
    HUMAN(["Human reviewer"])
    OUT(["Final response"])
    USER --> SUPER
    SUPER --> A
    SUPER --> B
    A --> TOOL
    B --> TOOL
    TOOL -->|Yes| EXEC --> SUPER
    TOOL -->|No| INT
    INT -->|Yes| HUMAN --> SUPER
    INT -->|No| OUT
    SUPER <--> CHK
    style SUPER fill:#4f46e5,stroke:#4338ca,color:#fff
    style CHK fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
    style HUMAN fill:#f59e0b,stroke:#d97706,color:#1f2937
from typing import TypedDict, Annotated
from langgraph.graph import StateGraph, START, END
from langgraph.graph.message import add_messages
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage

class State(TypedDict):
    messages: Annotated[list, add_messages]
    error: str
    retry_count: int

llm = ChatOpenAI(model="gpt-4o-mini")

def call_llm(state: State) -> dict:
    try:
        response = llm.invoke(state["messages"])
        return {
            "messages": [response],
            "error": "",
            "retry_count": state.get("retry_count", 0),
        }
    except Exception as e:
        return {
            "error": str(e),
            "retry_count": state.get("retry_count", 0) + 1,
        }

By catching exceptions and writing error information to state, you keep the graph running and let downstream nodes or routing logic decide how to recover.

Fallback Edges Based on Error State

Use conditional edges to route to different nodes depending on whether an error occurred:

from typing import Literal

def check_error(state: State) -> Literal["retry", "fallback", "continue"]:
    if state.get("error"):
        if state.get("retry_count", 0) < 3:
            return "retry"
        return "fallback"
    return "continue"

def retry_node(state: State) -> dict:
    """Wait briefly and clear the error for retry."""
    import time
    time.sleep(1)  # Back off before retry
    return {"error": ""}

def fallback_node(state: State) -> dict:
    """Provide a graceful degradation response."""
    return {
        "messages": [AIMessage(
            content="I encountered an issue processing your request. "
            "Here is what I can tell you based on available information."
        )],
        "error": "",
    }

builder = StateGraph(State)
builder.add_node("agent", call_llm)
builder.add_node("retry", retry_node)
builder.add_node("fallback", fallback_node)
builder.add_node("respond", lambda s: s)

builder.add_edge(START, "agent")
builder.add_conditional_edges("agent", check_error, {
    "retry": "retry",
    "fallback": "fallback",
    "continue": "respond",
})
builder.add_edge("retry", "agent")  # Loop back for retry
builder.add_edge("fallback", END)
builder.add_edge("respond", END)

graph = builder.compile()

This pattern gives the agent three attempts before falling back to a graceful degradation response.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Exponential Backoff Retry

For more sophisticated retry logic, implement exponential backoff:

import time

def smart_retry(state: State) -> dict:
    count = state.get("retry_count", 0)
    delay = min(2 ** count, 30)  # 1s, 2s, 4s, 8s... max 30s
    time.sleep(delay)
    return {"error": ""}

This prevents overwhelming a failing service with rapid retries while still recovering quickly from transient errors.

Tool Error Recovery

Tools fail frequently — APIs return errors, queries time out, external services go down. Build error handling directly into your tools:

from langchain_core.tools import tool
import httpx

@tool
def fetch_data(url: str) -> str:
    """Fetch data from a URL with error handling."""
    try:
        response = httpx.get(url, timeout=10)
        response.raise_for_status()
        return response.text[:2000]
    except httpx.TimeoutException:
        return "ERROR: Request timed out. The server may be slow or unreachable."
    except httpx.HTTPStatusError as e:
        return f"ERROR: HTTP {e.response.status_code}. The resource may not exist."
    except Exception as e:
        return f"ERROR: {type(e).__name__}: {e}"

Returning error strings instead of raising exceptions lets the LLM see the error and decide how to proceed — perhaps by trying a different URL or rephrasing the query.

Dead-End Detection

Sometimes the agent gets stuck in a loop without making progress. Detect this by tracking state changes:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

def detect_stall(state: State) -> Literal["continue", "abort"]:
    messages = state["messages"]
    if len(messages) < 4:
        return "continue"

    # Check if last 3 AI messages are similar (stuck in a loop)
    recent_ai = [
        m.content for m in messages[-6:]
        if isinstance(m, AIMessage)
    ][-3:]

    if len(recent_ai) == 3 and len(set(recent_ai)) == 1:
        return "abort"
    return "continue"

def abort_node(state: State) -> dict:
    return {
        "messages": [AIMessage(
            content="I appear to be stuck. Let me summarize what I have so far "
            "and suggest a different approach."
        )]
    }

Combining Checkpointing with Error Recovery

Checkpointing and error handling work together for maximum resilience:

from langgraph.checkpoint.memory import MemorySaver

memory = MemorySaver()
graph = builder.compile(checkpointer=memory)

config = {"configurable": {"thread_id": "resilient-session"}}

try:
    result = graph.invoke(
        {"messages": [HumanMessage(content="Process this complex request")]},
        config,
    )
except Exception:
    # Graph crashed — but state is checkpointed
    # Resume from last successful node
    result = graph.invoke(None, config)

Even if the entire process crashes, the checkpointed state lets you resume from the last successful node rather than restarting the entire workflow.

FAQ

Should I catch all exceptions in every node?

No. Catch exceptions that you can meaningfully handle — API errors, timeouts, validation failures. Let unexpected errors (programming bugs, out-of-memory) propagate so they surface during development rather than being silently swallowed.

How do I log errors without exposing them to the user?

Write errors to a separate state field like error_log that your response formatting node ignores. Alternatively, use Python logging within nodes to send error details to your observability stack while returning user-friendly messages to state.

Can I set a global timeout for the entire graph execution?

LangGraph does not have a built-in global timeout. Implement it at the application level by running graph.ainvoke() inside an asyncio.wait_for() with your desired timeout. If the timeout triggers, the checkpointed state is still available for later resumption.


#LangGraph #ErrorHandling #RetryLogic #FaultTolerance #Python #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Agentic AI

Human-in-the-Loop Hybrid Agents: 73% Fewer Errors in 2026

Fully autonomous agents are still a fantasy in production. LangGraph's interrupt() lets you pause for human approval mid-graph without losing state. We cover approve/edit/reject/respond actions and CallSphere's escalation ladder.

Agentic AI

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Beyond single-shot RAG — agentic RAG with LangGraph that re-retrieves, self-grades, and rewrites queries. With evals that catch silent retrieval drift.

Agentic AI

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

How LangGraph's StateGraph, channels, and reducers actually work — with a working multi-step agent, eval hooks at every node, and the patterns that survive production.

Agentic AI

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

Use LangGraph's checkpointer to make agents resumable across crashes and human-in-the-loop pauses, then replay any checkpoint into your eval pipeline.

Agentic AI

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.

Agentic AI

Browser Agents with LangGraph + Playwright: Visual Evaluation Pipelines That Don't Lie

Build a browser agent with LangGraph and Playwright that does multi-step web tasks, then ground-truth its work with visual diffs and DOM-based evaluators.