Skip to content
Learn Agentic AI
Learn Agentic AI12 min read27 views

Built-in Tracing in OpenAI Agents SDK: Visualize and Debug Workflows

Learn how the OpenAI Agents SDK automatically traces every agent run with agent_span, generation_span, and function_span, and how to visualize traces in the OpenAI dashboard for debugging.

Why Tracing Matters for Agentic Systems

When you build a traditional API, debugging is straightforward: you read the request, follow the handler logic, and inspect the response. Agentic systems shatter that simplicity. A single user query might trigger an orchestrator agent, which delegates to two specialist agents, each calling three tools, with the orchestrator looping back for a second pass based on intermediate results. Without tracing, debugging this is like navigating a cave without a flashlight.

Tracing gives you a structured, hierarchical record of everything that happened during an agent run. You see which agent was active, what LLM calls were made, which tools were invoked, what arguments were passed, and how long each step took. OpenAI's Agents SDK ships with automatic tracing built in, so you get this visibility without writing a single line of instrumentation code.

How Auto-Tracing Works

Every call to Runner.run() automatically creates a trace — a top-level container that groups all the spans generated during that execution. Within the trace, the SDK creates three types of spans:

flowchart LR
    INPUT(["User input"])
    AGENT["Agent<br/>name plus instructions"]
    HAND{"Handoff to<br/>another agent?"}
    SUB["Sub-agent<br/>specialist"]
    GUARD{"Guardrail<br/>passed?"}
    TOOL["Tool call"]
    SDK[("Tracing<br/>OpenAI dashboard")]
    OUT(["Final output"])
    INPUT --> AGENT --> HAND
    HAND -->|Yes| SUB --> GUARD
    HAND -->|No| GUARD
    GUARD -->|Yes| TOOL --> AGENT
    GUARD -->|Block| OUT
    AGENT --> OUT
    AGENT --> SDK
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style SDK fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
  • agent_span: Created whenever an agent becomes active. If your orchestrator hands off to a research agent, you will see separate agent spans for each.
  • generation_span: Created for every LLM API call. This captures the model name, input messages, output, token counts, and latency.
  • function_span: Created whenever a tool function is invoked. This records the tool name, input arguments, and return value.

Here is a minimal example that produces a fully traced run:

from agents import Agent, Runner, function_tool

@function_tool
def get_weather(city: str) -> str:
    """Fetch current weather for a city."""
    return f"72F and sunny in {city}"

@function_tool
def get_population(city: str) -> str:
    """Fetch population data for a city."""
    return f"{city} has a population of 1.5 million"

agent = Agent(
    name="City Info Agent",
    instructions="You provide city information using the available tools.",
    tools=[get_weather, get_population],
)

result = Runner.run_sync(agent, "Tell me about Austin, Texas")
print(result.final_output)

When this code runs, the SDK automatically generates a trace with the following hierarchy:

Trace: "Agent run"
  +-- agent_span: City Info Agent
       +-- generation_span: gpt-4o (initial reasoning)
       +-- function_span: get_weather(city="Austin, Texas")
       +-- function_span: get_population(city="Austin, Texas")
       +-- generation_span: gpt-4o (final synthesis)

You did not annotate anything. The SDK intercepted every meaningful step and recorded it.

Viewing Traces in the OpenAI Dashboard

Traces generated by the Agents SDK are automatically sent to the OpenAI platform. Navigate to the Traces section in the OpenAI dashboard to see a timeline view of every run. Each trace can be expanded to reveal the full span hierarchy.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

The dashboard provides several critical debugging views:

Timeline View — Shows spans arranged on a horizontal time axis. This immediately reveals where your agent is spending time. If a tool call takes 3 seconds while everything else takes milliseconds, you spot the bottleneck instantly.

Span Detail View — Click any span to see its full payload. For a generation_span, you see the exact messages sent to the model, the completion returned, the token count, and the model used. For a function_span, you see the arguments and return value.

Trace Metadata — Each trace carries metadata including a unique trace ID, the total duration, the workflow name, and any custom tags you attach. This makes it easy to filter traces in the dashboard.

Controlling Trace Behavior

By default, every Runner.run() call is traced. You can customize this behavior:

from agents import Runner

# Disable tracing for a specific run
result = Runner.run_sync(agent, "Hello", run_config=RunConfig(tracing_disabled=True))

# Set a custom workflow name for easier filtering
result = Runner.run_sync(
    agent,
    "Tell me about Austin",
    run_config=RunConfig(workflow_name="city-info-lookup"),
)

Setting a meaningful workflow_name is strongly recommended for production systems. Instead of seeing dozens of generic "Agent run" traces, you see "lead-qualification," "support-ticket-triage," and "document-summarization," making it trivial to filter and compare.

Tracing Multi-Agent Handoffs

Tracing becomes especially valuable with handoffs. When one agent transfers control to another, the trace captures the full chain:

from agents import Agent, Runner, handoff

research_agent = Agent(
    name="Research Agent",
    instructions="You perform deep research on topics.",
    tools=[search_web, read_document],
)

summary_agent = Agent(
    name="Summary Agent",
    instructions="You summarize research findings concisely.",
)

orchestrator = Agent(
    name="Orchestrator",
    instructions="Route research requests to the research agent, then summarize.",
    handoffs=[handoff(research_agent), handoff(summary_agent)],
)

result = Runner.run_sync(orchestrator, "Research quantum computing trends")

The resulting trace looks like:

Trace: "Agent run"
  +-- agent_span: Orchestrator
       +-- generation_span: gpt-4o (routing decision)
       +-- agent_span: Research Agent
            +-- generation_span: gpt-4o (research planning)
            +-- function_span: search_web(query="quantum computing 2026 trends")
            +-- function_span: read_document(url="...")
            +-- generation_span: gpt-4o (synthesis)
       +-- agent_span: Summary Agent
            +-- generation_span: gpt-4o (summarization)

This hierarchical view shows you exactly how control flowed between agents, which tools were invoked at each stage, and how long each agent held the conversation.

Debugging Common Issues with Traces

Traces expose several common failure patterns that are otherwise difficult to diagnose:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Infinite Loops — If an agent keeps calling the same tool with identical arguments, the trace shows a repeating pattern of function_span entries. You can set max_turns on the Runner to prevent runaway execution and use the trace to identify why the agent is looping.

Wrong Agent Routing — In a multi-agent system, the trace reveals which agent handled each turn. If the orchestrator routes a billing question to the technical support agent, you see it immediately in the agent_span hierarchy.

Token Bloat — Generation spans include token counts. If a single LLM call consumes 15,000 tokens when you expected 2,000, the trace highlights the problem. This often points to overly verbose tool outputs being fed back into the conversation.

Slow Tool Calls — The timeline view shows duration for each span. A function_span that takes 8 seconds while others take 200 milliseconds points you directly to the external service or database query that needs optimization.

Best Practices for Production Tracing

  1. Always set workflow_name — Generic trace names become useless at scale. Name your workflows after the user intent they serve.

  2. Use trace IDs for correlation — Pass the trace ID into your application logs so you can cross-reference agent behavior with your existing observability stack.

  3. Monitor trace duration trends — A trace that averaged 2 seconds last week but now averages 6 seconds signals a regression, even if no errors are thrown.

  4. Review traces during incidents — When users report unexpected agent behavior, the trace is the first place to look. It shows you exactly what the agent did, not what you assumed it would do.

  5. Sample in high-traffic environments — If your agent handles thousands of requests per minute, trace a representative sample rather than every request to manage storage costs.

Built-in tracing transforms agent debugging from guesswork into inspection. The OpenAI Agents SDK makes this effortless by auto-instrumenting every agent run, LLM call, and tool invocation without requiring you to modify your application code.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Infrastructure

Monitoring WebSocket Health: Heartbeats and Prometheus in 2026

How to actually observe a WebSocket fleet: ping/pong heartbeats, Prometheus metrics that matter, dead-man switches, and the alerts that fire before customers notice.

Agentic AI

Agent Tracing 101: Spans, Sessions, and the Hidden Failure Modes They Reveal

Tracing fundamentals for production AI agents — span hierarchy, session correlation, and the failure patterns that only show up when you trace every step.

Agentic AI

The Agent Evaluation Stack in 2026: From Trace to Eval Score

How the modern agent eval stack actually flows: instrument, trace, dataset, evaluator, score, CI gate. The full pipeline that keeps agents from regressing.

Agentic AI

Browser Agents with LangGraph + Playwright: Visual Evaluation Pipelines That Don't Lie

Build a browser agent with LangGraph and Playwright that does multi-step web tasks, then ground-truth its work with visual diffs and DOM-based evaluators.

Agentic AI

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

Build a working computer-use agent with the OpenAI Computer Use tool — clicks, types, scrolls a real browser — then evaluate task success on a benchmark suite.

AI Voice Agents

MOS Call Quality Scoring for AI Voice Operations in 2026: Beyond 4.2

MOS 4.3+ is the band where AI voice feels human. Drop below 3.6 and conversations break. Here is how to measure, improve, and alert on MOS in production AI voice using G.711, Opus, and the underlying packet loss / jitter / latency math.