Skip to content
AI Infrastructure
AI Infrastructure12 min read0 views

OpenTelemetry GenAI Conventions for AI Agents in 2026

The OTel GenAI semantic conventions exited experimental for client spans in early 2026. Here's how CallSphere instruments 37 voice and chat agents with gen_ai.* attributes that work across Datadog, Honeycomb, and Grafana.

TL;DR — In 2026 you don't write custom span attributes for "model name" anymore. You use gen_ai.request.model and your traces work in every backend that supports OTel.

What goes wrong

flowchart LR
  Browser["Browser / Phone"] -- "WebSocket /ws" --> LB["Load Balancer<br/>sticky session"]
  LB --> Pod1["Node A · Socket.IO"]
  LB --> Pod2["Node B · Socket.IO"]
  Pod1 -- "pub/sub" --> Redis[("Redis cluster")]
  Pod2 -- "pub/sub" --> Redis
  Pod1 --> AI["AI Worker · OpenAI Realtime"]
  Pod2 --> AI
CallSphere reference architecture

For two years every team rolled its own LLM-tracing schema. model, llm.model, openai.model, anthropic.model — all meant the same thing, none queried the same way. A platform team that wanted to chart "tokens spent per model per service" had to write a per-vendor adapter for every framework. By late 2025, the OTel GenAI SIG stabilized client spans and metrics, and most agent frameworks (OpenAI Agents SDK, LangChain, LlamaIndex, AutoGen) shipped emitters by Q1 2026.

The trap is that the agent spec is still experimental, and most production agents are agents — not single LLM calls. If you only instrument the chat-completions span you miss the tool-call planning, the handoff between sub-agents, and the loop. You end up with a trace that looks fast and an experience that feels slow.

How to monitor

Use three layers of OTel GenAI conventions:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
  1. gen_ai.client spans (stable) — one per LLM round-trip. Attributes: gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons.
  2. gen_ai.agent spans (experimental) — one per agent invocation. Attributes: gen_ai.agent.name, gen_ai.agent.id, gen_ai.agent.description.
  3. gen_ai.tool.* events — attached to agent spans. Captures every tool call the agent makes and its result.

Standard metrics in 2026: gen_ai.client.token.usage (histogram), gen_ai.client.operation.duration (histogram). Datadog, Honeycomb, Grafana, and OpenObserve all auto-detect these.

CallSphere stack

We run 37 agents across six verticals on k3s with Cloudflare Tunnel. Every agent emits OTel GenAI spans through an OpenTelemetry Collector deployed as a DaemonSet. The collector tail-samples to 5% (100% for errors and slow turns) and forwards to two backends:

  • Honeycomb for tracing (developer ergonomics on agent traces)
  • Prometheus + Grafana for SLO dashboards

The Healthcare FastAPI service on :8084 decorates each route with our @trace_genai_agent decorator that auto-emits parent agent span and child client spans. The Real Estate 6-container pod sends spans across NATS subjects and reuses the trace context header so a single call shows as one trace across all six containers. Sales WebSocket workers (PM2) batch-export every 5 seconds. The After-hours Bull/Redis queue worker emits one trace per job — Bull's job ID becomes the trace ID prefix.

Plans on /pricing include trace export to your own OTel collector at the $499 tier; $1499 enterprise gets a dedicated tenant in our Honeycomb. Try it on the 14-day trial.

Implementation

  1. Install the OTel SDK for your framework. For Python:
pip install opentelemetry-distro \
  opentelemetry-instrumentation-openai \
  opentelemetry-exporter-otlp
  1. Wrap your agent loop with explicit agent spans:
from opentelemetry import trace
tracer = trace.get_tracer("callsphere.healthcare")

def run_agent(user_input: str):
    with tracer.start_as_current_span(
        "gen_ai.agent.invoke",
        attributes={
            "gen_ai.agent.name": "healthcare_intake",
            "gen_ai.agent.id": "hc-intake-v3",
            "gen_ai.system": "openai",
        },
    ) as span:
        # tool calls and llm calls inside here
        # auto-instrument adds gen_ai.client spans
        result = agent_loop(user_input)
        span.set_attribute("gen_ai.completion.text", result.text[:512])
        return result
  1. Configure the collector to validate semconv:
processors:
  transform:
    metric_statements:
      - context: datapoint
        statements:
          - keep_keys(attributes, ["gen_ai.request.model","gen_ai.system"])
  1. Build dashboards on the standard names. A "tokens per model per route" panel that uses gen_ai.request.model works for OpenAI, Anthropic, and Cohere with no code changes.

  2. Tail-sample. 100% of error traces, 100% of traces with FTL > 1500ms, 5% of everything else. Tail-sampling at the collector saves 95% of storage cost.

    Still reading? Stop comparing — try CallSphere live.

    CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

FAQ

Q: Are GenAI agent spans stable yet? A: Client spans and metrics are stable. Agent and framework spans are experimental but have been very stable in practice through Q1 2026.

Q: Do I need a vendor SDK on top of OTel? A: No. OTel + auto-instrumentation covers 80% of needs. Add a vendor SDK (Langfuse, LangSmith) if you want their UI on top — they all consume OTel.

Q: How do I keep PII out of the spans? A: Use the collector's redaction processor or run Microsoft Presidio in a sidecar before export. Our /industries/healthcare build does this in the collector.

Q: Will my Datadog APM see this? A: Yes. Datadog LLM Observability natively maps OTel GenAI semconv to its product UI as of late 2025.

Q: What about voice-specific attributes? A: We add callsphere.audio.first_token_ms and callsphere.audio.barge_in_count as custom attributes — namespaced so they don't collide with future OTel additions.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Infrastructure

Monitoring WebSocket Health: Heartbeats and Prometheus in 2026

How to actually observe a WebSocket fleet: ping/pong heartbeats, Prometheus metrics that matter, dead-man switches, and the alerts that fire before customers notice.

Agentic AI

Agent Tracing 101: Spans, Sessions, and the Hidden Failure Modes They Reveal

Tracing fundamentals for production AI agents — span hierarchy, session correlation, and the failure patterns that only show up when you trace every step.

Agentic AI

The Agent Evaluation Stack in 2026: From Trace to Eval Score

How the modern agent eval stack actually flows: instrument, trace, dataset, evaluator, score, CI gate. The full pipeline that keeps agents from regressing.

AI Voice Agents

MOS Call Quality Scoring for AI Voice Operations in 2026: Beyond 4.2

MOS 4.3+ is the band where AI voice feels human. Drop below 3.6 and conversations break. Here is how to measure, improve, and alert on MOS in production AI voice using G.711, Opus, and the underlying packet loss / jitter / latency math.

AI Engineering

Arize Phoenix: Open-Source LLM Tracing in 2026 Reviewed Honestly

Arize Phoenix is the open-source LLM observability tool that grew up significantly in 2026. Tracing, evals, and the OTel-native approach that makes Phoenix portable.

AI Engineering

Langfuse 2026 Update: Evals, Prompt Management, and Datasets Mature

Langfuse's April 2026 release ships online evals, prompt versioning, and dataset workflows. Why self-hosted observability is worth the operational lift in 2026 builds.