Error Tracking in AI Agent Systems: Sentry, PagerDuty, and Custom Alerting

Agent Error Modes Are Different

Traditional applications have well-understood failure modes: null pointer exceptions, connection timeouts, authentication failures. AI agents add an entirely new category of errors that are harder to detect and classify. An LLM might return a syntactically valid response that calls a nonexistent tool. A tool call might succeed with HTTP 200 but return data the agent misinterprets. The agent might enter an infinite loop of tool calls without ever producing a final answer.

These failure modes require error tracking that goes beyond exception monitoring. You need to classify errors by type, route alerts based on severity and impact, and build incident response workflows that account for the probabilistic nature of LLM behavior.

Classifying Agent Errors

Define a taxonomy of error types so your alerting can be granular. Not all agent errors deserve the same response.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    INC(["Production incident"])
    DETECT["Detect<br/>alerts plus user reports"]
    MIT["Mitigate<br/>rollback or feature flag"]
    RES["Resolve"]
    DOC["Timeline doc<br/>events plus actions"]
    RCA{"5 whys plus<br/>causal graph"}
    AI["Action items<br/>owner plus due date"]
    SHARE(["Blameless review"])
    LEARN[("Runbook plus<br/>eval added")]
    INC --> DETECT --> MIT --> RES --> DOC --> RCA --> AI --> SHARE --> LEARN
    style RCA fill:#4f46e5,stroke:#4338ca,color:#fff
    style LEARN fill:#059669,stroke:#047857,color:#fff

from enum import Enum

class AgentErrorType(Enum):
    # Infrastructure errors - immediate attention
    LLM_API_UNREACHABLE = "llm_api_unreachable"
    DATABASE_CONNECTION_FAILED = "database_connection_failed"
    TOOL_SERVER_DOWN = "tool_server_down"

    # LLM behavior errors - investigate if frequent
    LLM_INVALID_TOOL_CALL = "llm_invalid_tool_call"
    LLM_REFUSED_REQUEST = "llm_refused_request"
    LLM_INFINITE_LOOP = "llm_infinite_loop"
    LLM_CONTEXT_OVERFLOW = "llm_context_overflow"

    # Tool execution errors - may need tool-specific fixes
    TOOL_EXECUTION_FAILED = "tool_execution_failed"
    TOOL_TIMEOUT = "tool_timeout"
    TOOL_INVALID_RESPONSE = "tool_invalid_response"

    # Validation errors - usually indicates prompt issues
    OUTPUT_VALIDATION_FAILED = "output_validation_failed"
    GUARDRAIL_TRIGGERED = "guardrail_triggered"

class AgentError(Exception):
    def __init__(
        self,
        error_type: AgentErrorType,
        message: str,
        severity: str = "error",
        context: dict = None,
    ):
        super().__init__(message)
        self.error_type = error_type
        self.severity = severity
        self.context = context or {}

Integrating Sentry for Error Tracking

Sentry captures exceptions with full stack traces, groups them by root cause, and tracks their frequency over time. Configure it to enrich agent errors with custom context.

import sentry_sdk
from sentry_sdk import set_tag, set_context, capture_exception

sentry_sdk.init(
    dsn="https://[email protected]/project-id",
    traces_sample_rate=0.1,
    environment="production",
    release="[email protected]",
)

async def handle_agent_error(error: AgentError, conversation_id: str, user_id: str):
    """Report agent errors to Sentry with rich context."""
    set_tag("error_type", error.error_type.value)
    set_tag("severity", error.severity)
    set_tag("agent_name", error.context.get("agent_name", "unknown"))

    set_context("agent", {
        "conversation_id": conversation_id,
        "user_id": user_id,
        "error_type": error.error_type.value,
        "model": error.context.get("model"),
        "tool_name": error.context.get("tool_name"),
        "step": error.context.get("step"),
    })

    capture_exception(error)

Building a Custom Alert Router

Different error types warrant different responses. Infrastructure errors need PagerDuty pages. LLM behavior errors need Slack notifications. Validation errors need logging for later analysis.

from dataclasses import dataclass
import httpx

@dataclass
class AlertConfig:
    pagerduty_key: str
    slack_webhook: str
    email_endpoint: str

class AlertRouter:
    def __init__(self, config: AlertConfig):
        self.config = config
        self.client = httpx.AsyncClient()

    async def route_alert(self, error: AgentError, conversation_id: str):
        error_type = error.error_type

        # Critical infrastructure errors -> PagerDuty
        if error_type in (
            AgentErrorType.LLM_API_UNREACHABLE,
            AgentErrorType.DATABASE_CONNECTION_FAILED,
            AgentErrorType.TOOL_SERVER_DOWN,
        ):
            await self._page_oncall(error, conversation_id)
            await self._notify_slack(error, conversation_id, channel="#incidents")

        # LLM behavior errors -> Slack warning
        elif error_type in (
            AgentErrorType.LLM_INFINITE_LOOP,
            AgentErrorType.LLM_CONTEXT_OVERFLOW,
        ):
            await self._notify_slack(error, conversation_id, channel="#agent-alerts")

        # Tool errors -> Slack if frequent
        elif error_type in (
            AgentErrorType.TOOL_EXECUTION_FAILED,
            AgentErrorType.TOOL_TIMEOUT,
        ):
            if await self._error_rate_exceeds_threshold(error_type, threshold=10, window_minutes=5):
                await self._notify_slack(error, conversation_id, channel="#agent-alerts")

    async def _page_oncall(self, error: AgentError, conversation_id: str):
        await self.client.post(
            "https://events.pagerduty.com/v2/enqueue",
            json={
                "routing_key": self.config.pagerduty_key,
                "event_action": "trigger",
                "payload": {
                    "summary": f"Agent error: {error.error_type.value} - {str(error)}",
                    "severity": "critical",
                    "source": "agent-service",
                    "custom_details": {
                        "conversation_id": conversation_id,
                        **error.context,
                    },
                },
            },
        )

    async def _notify_slack(self, error: AgentError, conversation_id: str, channel: str):
        await self.client.post(
            self.config.slack_webhook,
            json={
                "channel": channel,
                "text": f"*Agent Error*: {error.error_type.value}\n"
                        f"Message: {str(error)}\n"
                        f"Conversation: {conversation_id}",
            },
        )

Detecting Agent-Specific Failure Patterns

Some agent failures do not raise exceptions. Detect them with runtime checks.

MAX_TOOL_CALLS_PER_TURN = 10
MAX_AGENT_TURNS = 25

class AgentLoopGuard:
    def __init__(self):
        self.tool_call_count = 0
        self.turn_count = 0
        self.seen_tool_calls = []

    def check_tool_call(self, tool_name: str, arguments: dict):
        self.tool_call_count += 1
        call_signature = f"{tool_name}:{hash(str(sorted(arguments.items())))}"

        # Detect infinite loop: same tool call repeated
        if call_signature in self.seen_tool_calls[-3:]:
            raise AgentError(
                AgentErrorType.LLM_INFINITE_LOOP,
                f"Repeated tool call detected: {tool_name}",
                severity="critical",
                context={"tool_name": tool_name, "repeat_count": 3},
            )

        if self.tool_call_count > MAX_TOOL_CALLS_PER_TURN:
            raise AgentError(
                AgentErrorType.LLM_INFINITE_LOOP,
                f"Tool call limit exceeded: {self.tool_call_count}",
                severity="error",
            )

        self.seen_tool_calls.append(call_signature)

    def check_turn(self):
        self.turn_count += 1
        if self.turn_count > MAX_AGENT_TURNS:
            raise AgentError(
                AgentErrorType.LLM_INFINITE_LOOP,
                f"Agent turn limit exceeded: {self.turn_count}",
                severity="error",
            )

FAQ

How do I avoid alert fatigue with AI agents?

Use rate-based alerting instead of per-error alerting. A single tool failure is normal — tools can be temporarily unavailable. Page oncall only when the error rate for a given type exceeds a threshold within a time window. For LLM behavior errors, alert on percentage of conversations affected rather than raw count. Review and tune thresholds weekly during the first month of deployment.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Should I retry LLM calls automatically before raising an error?

Yes, but with limits. Retry transient errors like rate limits (HTTP 429) and server errors (HTTP 500-503) with exponential backoff, up to 3 attempts. Do not retry content policy violations (HTTP 400) or context length errors — these will fail again with the same input. Track retry counts in your error metadata so you can monitor retry rates.

How do I handle errors gracefully so the user gets a useful response?

Implement a fallback chain. If the primary model fails, try a fallback model. If all LLM calls fail, return a static message like "I am having trouble processing your request. Please try again in a moment." Never expose raw error messages or stack traces to users. Log the full error details for your engineering team and return a user-friendly message with a reference ID they can share with support.

#ErrorTracking #Sentry #PagerDuty #Alerting #IncidentResponse #AgenticAI #LearnAI #AIEngineering

Error Tracking in AI Agent Systems: Sentry, PagerDuty, and Custom Alerting

Agent Error Modes Are Different

Classifying Agent Errors

Integrating Sentry for Error Tracking

Building a Custom Alert Router

Detecting Agent-Specific Failure Patterns

FAQ

How do I avoid alert fatigue with AI agents?

Should I retry LLM calls automatically before raising an error?

How do I handle errors gracefully so the user gets a useful response?

Try CallSphere AI Voice Agents

Related Articles You May Like

Agent Incident Retros: How to Run a Postmortem When an LLM Made the Mistake

Twilio Notify EOL: AI Multi-Channel Reach Without Notify (2026)

Realtime Alerting on Call Sentiment Drops: A Pipeline That Actually Pages People in 2026

A Postmortem Template for AI Agent Incidents

Build AI Agent Observability with Sentry + Vercel Analytics (2026)

Alert Routing for AI Agent Failures: PagerDuty, Opsgenie, and Beyond