Skip to content
Learn Agentic AI
Learn Agentic AI13 min read29 views

Logging Best Practices for AI Agents: Structured Logs for Debugging and Audit

Implement structured logging for AI agent systems with correlation IDs, log levels, sensitive data redaction, and queryable JSON output that makes debugging production agent issues fast and audit-ready.

Why Standard Logging Falls Short for Agents

A typical web application logs a request, processes it, and logs a response. An AI agent might process a single user message through five or more steps: prompt construction, memory retrieval, LLM inference, tool calls, response validation, and memory storage. Each step can fail independently, and the failure modes are fundamentally different from traditional applications — an LLM might return a valid HTTP 200 response that contains completely wrong instructions for a tool call.

Standard print() statements or unstructured log lines make it nearly impossible to reconstruct what happened during a conversation. Structured logging with correlation IDs, consistent fields, and sensitive data redaction transforms your logs from a wall of text into a queryable debugging and audit system.

Setting Up Structured Logging with structlog

The structlog library produces JSON log lines with consistent fields that are easy to parse and query in log aggregation tools like Elasticsearch, Loki, or CloudWatch.

flowchart TD
    Q{"What matters most<br/>for your team?"}
    DIM1["Time to first<br/>production deploy"]
    DIM2["Total cost of<br/>ownership at scale"]
    DIM3["Debuggability and<br/>observability"]
    DIM4["Ecosystem and<br/>community support"]
    PICK{Score the<br/>four axes}
    A(["Pick<br/>Option A"])
    B(["Pick<br/>Option B"])
    Q --> DIM1 --> PICK
    Q --> DIM2 --> PICK
    Q --> DIM3 --> PICK
    Q --> DIM4 --> PICK
    PICK -->|Speed and ecosystem| A
    PICK -->|Control and TCO| B
    style Q fill:#4f46e5,stroke:#4338ca,color:#fff
    style PICK fill:#f59e0b,stroke:#d97706,color:#1f2937
    style A fill:#0ea5e9,stroke:#0369a1,color:#fff
    style B fill:#059669,stroke:#047857,color:#fff
import structlog
import uuid

structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer(),
    ],
    wrapper_class=structlog.BoundLogger,
    context_class=dict,
    logger_factory=structlog.PrintLoggerFactory(),
)

def get_logger(agent_name: str, conversation_id: str = None):
    """Create a logger bound with agent context."""
    if conversation_id is None:
        conversation_id = str(uuid.uuid4())
    return structlog.get_logger().bind(
        agent_name=agent_name,
        conversation_id=conversation_id,
    )

Every log line produced by this logger automatically includes the agent name, conversation ID, timestamp, and log level — all as structured JSON fields.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Correlation IDs Across Agent Steps

A single conversation generates logs across multiple functions and sometimes multiple services. Bind a conversation ID at the start and pass the logger through each step so every log line is linked.

async def handle_conversation(user_message: str, user_id: str):
    conversation_id = str(uuid.uuid4())
    log = get_logger("support-agent", conversation_id).bind(user_id=user_id)

    log.info("conversation_started", message_length=len(user_message))

    # Memory retrieval
    log.info("memory_retrieval_started")
    memories = await retrieve_memories(user_message)
    log.info("memory_retrieval_completed", results_count=len(memories))

    # LLM call
    log.info("llm_call_started", model="gpt-4o")
    response = await call_llm(user_message, memories)
    log.info(
        "llm_call_completed",
        model="gpt-4o",
        prompt_tokens=response.usage.prompt_tokens,
        completion_tokens=response.usage.completion_tokens,
        finish_reason=response.choices[0].finish_reason,
    )

    # Tool execution
    if response.tool_calls:
        for tool_call in response.tool_calls:
            log.info(
                "tool_call_started",
                tool_name=tool_call.function.name,
            )
            try:
                result = await execute_tool(tool_call)
                log.info("tool_call_completed", tool_name=tool_call.function.name)
            except Exception as e:
                log.error(
                    "tool_call_failed",
                    tool_name=tool_call.function.name,
                    error=str(e),
                )
                raise

    log.info("conversation_completed")
    return response.content

The resulting log output looks like this — every line shares the same conversation_id, making it trivial to filter in your log aggregation tool:

{"event": "conversation_started", "agent_name": "support-agent", "conversation_id": "a1b2c3d4...", "user_id": "user_789", "message_length": 142, "level": "info", "timestamp": "2026-03-17T10:30:00Z"}
{"event": "llm_call_completed", "agent_name": "support-agent", "conversation_id": "a1b2c3d4...", "model": "gpt-4o", "prompt_tokens": 1250, "completion_tokens": 340, "level": "info", "timestamp": "2026-03-17T10:30:02Z"}

Redacting Sensitive Data

Agent logs often contain user messages, PII, or API keys embedded in tool call arguments. Build a redaction processor that strips sensitive fields before they hit your log backend.

import re

SENSITIVE_PATTERNS = {
    "email": re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"),
    "phone": re.compile(r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b"),
    "ssn": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
    "api_key": re.compile(r"(sk-|pk-|api[_-]?key[=:]\s*)[a-zA-Z0-9]{20,}"),
}

def redact_sensitive_data(logger, method_name, event_dict):
    """structlog processor that redacts PII from log values."""
    for key, value in event_dict.items():
        if isinstance(value, str):
            for pattern_name, pattern in SENSITIVE_PATTERNS.items():
                value = pattern.sub(f"[REDACTED_{pattern_name.upper()}]", value)
            event_dict[key] = value
    return event_dict

# Add to structlog processors list before JSONRenderer
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        redact_sensitive_data,  # Runs before serialization
        structlog.processors.JSONRenderer(),
    ],
)

Choosing Log Levels for Agent Events

Use consistent log levels across your agent codebase. A clear convention prevents important signals from being buried in noise.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Level When to Use
DEBUG Prompt contents, full LLM responses, tool arguments
INFO Step start/completion, token counts, conversation lifecycle
WARNING Retries, fallback model usage, slow LLM responses
ERROR Tool failures, LLM errors, validation failures
CRITICAL Agent loop crashes, data corruption, auth failures

In production, set the level to INFO and enable DEBUG only when actively investigating an issue. This keeps log volume manageable while preserving enough context for post-incident analysis.

FAQ

Should I log the full LLM prompt and response?

Log full prompts and responses at DEBUG level only. At INFO level, log metadata like token counts, model name, and finish reason. Full prompts can contain PII and consume significant storage — a single conversation might generate megabytes of prompt text. For audit scenarios, consider writing full prompts to a separate, access-controlled store with shorter retention.

How do I correlate logs across multiple agents in a multi-agent system?

Use two IDs: a conversation_id that is unique per user conversation and a trace_id that follows the request across agent handoffs. When your triage agent calls a specialist agent, pass both IDs in the request. This lets you filter by conversation to see the full user interaction or by trace to see the technical execution path.

What log aggregation tools work best for agent logs?

Any tool that supports structured JSON logs works well. Grafana Loki is lightweight and integrates directly with Grafana dashboards. Elasticsearch with Kibana provides powerful full-text search across log fields. For cloud-native setups, AWS CloudWatch Logs Insights or Google Cloud Logging both support JSON field queries natively.


#Logging #StructuredLogging #Debugging #Audit #AIAgents #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Agentic AI

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

Handoffs done right — when one agent should hand control to another, how to preserve context, and how to evaluate the handoff decision itself.

AI Strategy

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Q1 2026 saw a record acquisition wave: Aircall bought Vogent (May), Meta acquired Manus and PlayAI, OpenAI closed six deals. The voice AI consolidation phase has begun.

Agentic AI

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.

Agentic AI

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

Use LangGraph's checkpointer to make agents resumable across crashes and human-in-the-loop pauses, then replay any checkpoint into your eval pipeline.

Agentic AI

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

How LangGraph's StateGraph, channels, and reducers actually work — with a working multi-step agent, eval hooks at every node, and the patterns that survive production.

Agentic AI

LangGraph Supervisor Pattern: Orchestrating Multi-Agent Teams in 2026

The supervisor pattern in LangGraph for coordinating specialist agents, with full code, an eval pipeline that scores routing accuracy, and the failure modes to watch for.