Logging Best Practices for AI Agents: Structured Logs for Debugging and Audit

Why Standard Logging Falls Short for Agents

A typical web application logs a request, processes it, and logs a response. An AI agent might process a single user message through five or more steps: prompt construction, memory retrieval, LLM inference, tool calls, response validation, and memory storage. Each step can fail independently, and the failure modes are fundamentally different from traditional applications — an LLM might return a valid HTTP 200 response that contains completely wrong instructions for a tool call.

Standard print() statements or unstructured log lines make it nearly impossible to reconstruct what happened during a conversation. Structured logging with correlation IDs, consistent fields, and sensitive data redaction transforms your logs from a wall of text into a queryable debugging and audit system.

Setting Up Structured Logging with structlog

The structlog library produces JSON log lines with consistent fields that are easy to parse and query in log aggregation tools like Elasticsearch, Loki, or CloudWatch.

flowchart TD
    Q{"What matters most<br/>for your team?"}
    DIM1["Time to first<br/>production deploy"]
    DIM2["Total cost of<br/>ownership at scale"]
    DIM3["Debuggability and<br/>observability"]
    DIM4["Ecosystem and<br/>community support"]
    PICK{Score the<br/>four axes}
    A(["Pick<br/>Option A"])
    B(["Pick<br/>Option B"])
    Q --> DIM1 --> PICK
    Q --> DIM2 --> PICK
    Q --> DIM3 --> PICK
    Q --> DIM4 --> PICK
    PICK -->|Speed and ecosystem| A
    PICK -->|Control and TCO| B
    style Q fill:#4f46e5,stroke:#4338ca,color:#fff
    style PICK fill:#f59e0b,stroke:#d97706,color:#1f2937
    style A fill:#0ea5e9,stroke:#0369a1,color:#fff
    style B fill:#059669,stroke:#047857,color:#fff

import structlog
import uuid

structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer(),
    ],
    wrapper_class=structlog.BoundLogger,
    context_class=dict,
    logger_factory=structlog.PrintLoggerFactory(),
)

def get_logger(agent_name: str, conversation_id: str = None):
    """Create a logger bound with agent context."""
    if conversation_id is None:
        conversation_id = str(uuid.uuid4())
    return structlog.get_logger().bind(
        agent_name=agent_name,
        conversation_id=conversation_id,
    )

Every log line produced by this logger automatically includes the agent name, conversation ID, timestamp, and log level — all as structured JSON fields.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Correlation IDs Across Agent Steps

A single conversation generates logs across multiple functions and sometimes multiple services. Bind a conversation ID at the start and pass the logger through each step so every log line is linked.

async def handle_conversation(user_message: str, user_id: str):
    conversation_id = str(uuid.uuid4())
    log = get_logger("support-agent", conversation_id).bind(user_id=user_id)

    log.info("conversation_started", message_length=len(user_message))

    # Memory retrieval
    log.info("memory_retrieval_started")
    memories = await retrieve_memories(user_message)
    log.info("memory_retrieval_completed", results_count=len(memories))

    # LLM call
    log.info("llm_call_started", model="gpt-4o")
    response = await call_llm(user_message, memories)
    log.info(
        "llm_call_completed",
        model="gpt-4o",
        prompt_tokens=response.usage.prompt_tokens,
        completion_tokens=response.usage.completion_tokens,
        finish_reason=response.choices[0].finish_reason,
    )

    # Tool execution
    if response.tool_calls:
        for tool_call in response.tool_calls:
            log.info(
                "tool_call_started",
                tool_name=tool_call.function.name,
            )
            try:
                result = await execute_tool(tool_call)
                log.info("tool_call_completed", tool_name=tool_call.function.name)
            except Exception as e:
                log.error(
                    "tool_call_failed",
                    tool_name=tool_call.function.name,
                    error=str(e),
                )
                raise

    log.info("conversation_completed")
    return response.content

The resulting log output looks like this — every line shares the same conversation_id, making it trivial to filter in your log aggregation tool:

{"event": "conversation_started", "agent_name": "support-agent", "conversation_id": "a1b2c3d4...", "user_id": "user_789", "message_length": 142, "level": "info", "timestamp": "2026-03-17T10:30:00Z"}
{"event": "llm_call_completed", "agent_name": "support-agent", "conversation_id": "a1b2c3d4...", "model": "gpt-4o", "prompt_tokens": 1250, "completion_tokens": 340, "level": "info", "timestamp": "2026-03-17T10:30:02Z"}

Redacting Sensitive Data

Agent logs often contain user messages, PII, or API keys embedded in tool call arguments. Build a redaction processor that strips sensitive fields before they hit your log backend.

import re

SENSITIVE_PATTERNS = {
    "email": re.compile(r"[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}"),
    "phone": re.compile(r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b"),
    "ssn": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
    "api_key": re.compile(r"(sk-|pk-|api[_-]?key[=:]\s*)[a-zA-Z0-9]{20,}"),
}

def redact_sensitive_data(logger, method_name, event_dict):
    """structlog processor that redacts PII from log values."""
    for key, value in event_dict.items():
        if isinstance(value, str):
            for pattern_name, pattern in SENSITIVE_PATTERNS.items():
                value = pattern.sub(f"[REDACTED_{pattern_name.upper()}]", value)
            event_dict[key] = value
    return event_dict

# Add to structlog processors list before JSONRenderer
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        redact_sensitive_data,  # Runs before serialization
        structlog.processors.JSONRenderer(),
    ],
)

Choosing Log Levels for Agent Events

Use consistent log levels across your agent codebase. A clear convention prevents important signals from being buried in noise.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Level	When to Use
DEBUG	Prompt contents, full LLM responses, tool arguments
INFO	Step start/completion, token counts, conversation lifecycle
WARNING	Retries, fallback model usage, slow LLM responses
ERROR	Tool failures, LLM errors, validation failures
CRITICAL	Agent loop crashes, data corruption, auth failures

In production, set the level to INFO and enable DEBUG only when actively investigating an issue. This keeps log volume manageable while preserving enough context for post-incident analysis.

FAQ

Should I log the full LLM prompt and response?

Log full prompts and responses at DEBUG level only. At INFO level, log metadata like token counts, model name, and finish reason. Full prompts can contain PII and consume significant storage — a single conversation might generate megabytes of prompt text. For audit scenarios, consider writing full prompts to a separate, access-controlled store with shorter retention.

How do I correlate logs across multiple agents in a multi-agent system?

Use two IDs: a conversation_id that is unique per user conversation and a trace_id that follows the request across agent handoffs. When your triage agent calls a specialist agent, pass both IDs in the request. This lets you filter by conversation to see the full user interaction or by trace to see the technical execution path.

What log aggregation tools work best for agent logs?

Any tool that supports structured JSON logs works well. Grafana Loki is lightweight and integrates directly with Grafana dashboards. Elasticsearch with Kibana provides powerful full-text search across log fields. For cloud-native setups, AWS CloudWatch Logs Insights or Google Cloud Logging both support JSON field queries natively.

#Logging #StructuredLogging #Debugging #Audit #AIAgents #AgenticAI #LearnAI #AIEngineering

Logging Best Practices for AI Agents: Structured Logs for Debugging and Audit

Why Standard Logging Falls Short for Agents

Setting Up Structured Logging with structlog

Correlation IDs Across Agent Steps

Redacting Sensitive Data

Choosing Log Levels for Agent Events

FAQ

Should I log the full LLM prompt and response?

How do I correlate logs across multiple agents in a multi-agent system?

What log aggregation tools work best for agent logs?

Try CallSphere AI Voice Agents

Related Articles You May Like

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

LangGraph Supervisor Pattern: Orchestrating Multi-Agent Teams in 2026