Skip to content
Learn Agentic AI
Learn Agentic AI13 min read13 views

Error Tracking in AI Agent Systems: Sentry, PagerDuty, and Custom Alerting

Implement comprehensive error tracking for AI agent systems with error classification, severity-based alert routing to Sentry and PagerDuty, and incident response workflows tailored to LLM failure modes.

Agent Error Modes Are Different

Traditional applications have well-understood failure modes: null pointer exceptions, connection timeouts, authentication failures. AI agents add an entirely new category of errors that are harder to detect and classify. An LLM might return a syntactically valid response that calls a nonexistent tool. A tool call might succeed with HTTP 200 but return data the agent misinterprets. The agent might enter an infinite loop of tool calls without ever producing a final answer.

These failure modes require error tracking that goes beyond exception monitoring. You need to classify errors by type, route alerts based on severity and impact, and build incident response workflows that account for the probabilistic nature of LLM behavior.

Classifying Agent Errors

Define a taxonomy of error types so your alerting can be granular. Not all agent errors deserve the same response.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart LR
    INC(["Production incident"])
    DETECT["Detect<br/>alerts plus user reports"]
    MIT["Mitigate<br/>rollback or feature flag"]
    RES["Resolve"]
    DOC["Timeline doc<br/>events plus actions"]
    RCA{"5 whys plus<br/>causal graph"}
    AI["Action items<br/>owner plus due date"]
    SHARE(["Blameless review"])
    LEARN[("Runbook plus<br/>eval added")]
    INC --> DETECT --> MIT --> RES --> DOC --> RCA --> AI --> SHARE --> LEARN
    style RCA fill:#4f46e5,stroke:#4338ca,color:#fff
    style LEARN fill:#059669,stroke:#047857,color:#fff
from enum import Enum

class AgentErrorType(Enum):
    # Infrastructure errors - immediate attention
    LLM_API_UNREACHABLE = "llm_api_unreachable"
    DATABASE_CONNECTION_FAILED = "database_connection_failed"
    TOOL_SERVER_DOWN = "tool_server_down"

    # LLM behavior errors - investigate if frequent
    LLM_INVALID_TOOL_CALL = "llm_invalid_tool_call"
    LLM_REFUSED_REQUEST = "llm_refused_request"
    LLM_INFINITE_LOOP = "llm_infinite_loop"
    LLM_CONTEXT_OVERFLOW = "llm_context_overflow"

    # Tool execution errors - may need tool-specific fixes
    TOOL_EXECUTION_FAILED = "tool_execution_failed"
    TOOL_TIMEOUT = "tool_timeout"
    TOOL_INVALID_RESPONSE = "tool_invalid_response"

    # Validation errors - usually indicates prompt issues
    OUTPUT_VALIDATION_FAILED = "output_validation_failed"
    GUARDRAIL_TRIGGERED = "guardrail_triggered"

class AgentError(Exception):
    def __init__(
        self,
        error_type: AgentErrorType,
        message: str,
        severity: str = "error",
        context: dict = None,
    ):
        super().__init__(message)
        self.error_type = error_type
        self.severity = severity
        self.context = context or {}

Integrating Sentry for Error Tracking

Sentry captures exceptions with full stack traces, groups them by root cause, and tracks their frequency over time. Configure it to enrich agent errors with custom context.

import sentry_sdk
from sentry_sdk import set_tag, set_context, capture_exception

sentry_sdk.init(
    dsn="https://[email protected]/project-id",
    traces_sample_rate=0.1,
    environment="production",
    release="[email protected]",
)

async def handle_agent_error(error: AgentError, conversation_id: str, user_id: str):
    """Report agent errors to Sentry with rich context."""
    set_tag("error_type", error.error_type.value)
    set_tag("severity", error.severity)
    set_tag("agent_name", error.context.get("agent_name", "unknown"))

    set_context("agent", {
        "conversation_id": conversation_id,
        "user_id": user_id,
        "error_type": error.error_type.value,
        "model": error.context.get("model"),
        "tool_name": error.context.get("tool_name"),
        "step": error.context.get("step"),
    })

    capture_exception(error)

Building a Custom Alert Router

Different error types warrant different responses. Infrastructure errors need PagerDuty pages. LLM behavior errors need Slack notifications. Validation errors need logging for later analysis.

from dataclasses import dataclass
import httpx

@dataclass
class AlertConfig:
    pagerduty_key: str
    slack_webhook: str
    email_endpoint: str

class AlertRouter:
    def __init__(self, config: AlertConfig):
        self.config = config
        self.client = httpx.AsyncClient()

    async def route_alert(self, error: AgentError, conversation_id: str):
        error_type = error.error_type

        # Critical infrastructure errors -> PagerDuty
        if error_type in (
            AgentErrorType.LLM_API_UNREACHABLE,
            AgentErrorType.DATABASE_CONNECTION_FAILED,
            AgentErrorType.TOOL_SERVER_DOWN,
        ):
            await self._page_oncall(error, conversation_id)
            await self._notify_slack(error, conversation_id, channel="#incidents")

        # LLM behavior errors -> Slack warning
        elif error_type in (
            AgentErrorType.LLM_INFINITE_LOOP,
            AgentErrorType.LLM_CONTEXT_OVERFLOW,
        ):
            await self._notify_slack(error, conversation_id, channel="#agent-alerts")

        # Tool errors -> Slack if frequent
        elif error_type in (
            AgentErrorType.TOOL_EXECUTION_FAILED,
            AgentErrorType.TOOL_TIMEOUT,
        ):
            if await self._error_rate_exceeds_threshold(error_type, threshold=10, window_minutes=5):
                await self._notify_slack(error, conversation_id, channel="#agent-alerts")

    async def _page_oncall(self, error: AgentError, conversation_id: str):
        await self.client.post(
            "https://events.pagerduty.com/v2/enqueue",
            json={
                "routing_key": self.config.pagerduty_key,
                "event_action": "trigger",
                "payload": {
                    "summary": f"Agent error: {error.error_type.value} - {str(error)}",
                    "severity": "critical",
                    "source": "agent-service",
                    "custom_details": {
                        "conversation_id": conversation_id,
                        **error.context,
                    },
                },
            },
        )

    async def _notify_slack(self, error: AgentError, conversation_id: str, channel: str):
        await self.client.post(
            self.config.slack_webhook,
            json={
                "channel": channel,
                "text": f"*Agent Error*: {error.error_type.value}\n"
                        f"Message: {str(error)}\n"
                        f"Conversation: {conversation_id}",
            },
        )

Detecting Agent-Specific Failure Patterns

Some agent failures do not raise exceptions. Detect them with runtime checks.

MAX_TOOL_CALLS_PER_TURN = 10
MAX_AGENT_TURNS = 25

class AgentLoopGuard:
    def __init__(self):
        self.tool_call_count = 0
        self.turn_count = 0
        self.seen_tool_calls = []

    def check_tool_call(self, tool_name: str, arguments: dict):
        self.tool_call_count += 1
        call_signature = f"{tool_name}:{hash(str(sorted(arguments.items())))}"

        # Detect infinite loop: same tool call repeated
        if call_signature in self.seen_tool_calls[-3:]:
            raise AgentError(
                AgentErrorType.LLM_INFINITE_LOOP,
                f"Repeated tool call detected: {tool_name}",
                severity="critical",
                context={"tool_name": tool_name, "repeat_count": 3},
            )

        if self.tool_call_count > MAX_TOOL_CALLS_PER_TURN:
            raise AgentError(
                AgentErrorType.LLM_INFINITE_LOOP,
                f"Tool call limit exceeded: {self.tool_call_count}",
                severity="error",
            )

        self.seen_tool_calls.append(call_signature)

    def check_turn(self):
        self.turn_count += 1
        if self.turn_count > MAX_AGENT_TURNS:
            raise AgentError(
                AgentErrorType.LLM_INFINITE_LOOP,
                f"Agent turn limit exceeded: {self.turn_count}",
                severity="error",
            )

FAQ

How do I avoid alert fatigue with AI agents?

Use rate-based alerting instead of per-error alerting. A single tool failure is normal — tools can be temporarily unavailable. Page oncall only when the error rate for a given type exceeds a threshold within a time window. For LLM behavior errors, alert on percentage of conversations affected rather than raw count. Review and tune thresholds weekly during the first month of deployment.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Should I retry LLM calls automatically before raising an error?

Yes, but with limits. Retry transient errors like rate limits (HTTP 429) and server errors (HTTP 500-503) with exponential backoff, up to 3 attempts. Do not retry content policy violations (HTTP 400) or context length errors — these will fail again with the same input. Track retry counts in your error metadata so you can monitor retry rates.

How do I handle errors gracefully so the user gets a useful response?

Implement a fallback chain. If the primary model fails, try a fallback model. If all LLM calls fail, return a static message like "I am having trouble processing your request. Please try again in a moment." Never expose raw error messages or stack traces to users. Log the full error details for your engineering team and return a user-friendly message with a reference ID they can share with support.


#ErrorTracking #Sentry #PagerDuty #Alerting #IncidentResponse #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Agentic AI

Agent Incident Retros: How to Run a Postmortem When an LLM Made the Mistake

Postmortems for agentic incidents need new sections. The 2026 retro template for incidents where the LLM was the proximate cause.

AI Infrastructure

Twilio Notify EOL: AI Multi-Channel Reach Without Notify (2026)

Twilio Notify reached end-of-life Dec 31 2025. We map the 2026 replacement stack — Conversations + Conversation Orchestrator + Messaging Service + push providers — and how CallSphere fans alerts to voice, SMS, WhatsApp, and email.

AI Engineering

Realtime Alerting on Call Sentiment Drops: A Pipeline That Actually Pages People in 2026

Sentiment alerting is easy to ship and hard to make useful. We cover thresholding, debouncing, baseline drift, and a Slack/PagerDuty integration that doesn't generate alert fatigue. Includes the SQL we use at CallSphere.

AI Engineering

A Postmortem Template for AI Agent Incidents

Standard SRE postmortems miss the half of an AI incident that matters: why did the agent decide that. Here's the template CallSphere has run for 11 production incidents in 12 months.

AI Engineering

Build AI Agent Observability with Sentry + Vercel Analytics (2026)

Sentry's 2026 Agent Monitoring auto-instruments OpenAI, Anthropic, Vercel AI SDK, and LangGraph. Pair with Vercel Web Analytics for full client + server visibility.

AI Engineering

Alert Routing for AI Agent Failures: PagerDuty, Opsgenie, and Beyond

Routing the right alert to the right human at 3am is hard. Routing AI-agent alerts is harder — half are model regressions, not infra. Here's a working routing tree for voice and chat agents.