Skip to content
Learn Agentic AI
Learn Agentic AI13 min read12 views

Building Custom Agent Dashboards: Visualizing Conversations, Costs, and Latency

Build production-grade Grafana dashboards for AI agent systems that visualize conversation throughput, per-model costs, LLM latency percentiles, and tool usage patterns using Prometheus metrics.

The Key Metrics Every Agent Dashboard Needs

Generic application dashboards track request rate, error rate, and latency. Agent dashboards need those plus metrics unique to LLM workloads: token consumption, cost per conversation, tool call success rates, and conversation completion rates. Without these, you are flying blind on the dimensions that matter most for agent reliability and cost control.

The foundation is a metrics collection layer that captures these signals at the right granularity, and a visualization layer that makes patterns visible at a glance.

Exposing Prometheus Metrics from Your Agent

Use the prometheus_client library to define counters, histograms, and gauges that capture agent-specific signals.

flowchart LR
    REQ(["Request"])
    BATCH["Continuous batching<br/>vLLM scheduler"]
    PREF{"Prefill or<br/>decode?"}
    PRE["Prefill phase<br/>parallel attention"]
    DEC["Decode phase<br/>token by token"]
    KV[("Paged KV cache")]
    SAMP["Sampling<br/>top-p, temp"]
    STREAM["Stream tokens<br/>to client"]
    REQ --> BATCH --> PREF
    PREF -->|First token| PRE --> KV
    PREF -->|Next token| DEC
    KV --> DEC --> SAMP --> STREAM
    SAMP -->|EOS| DONE(["Response complete"])
    style BATCH fill:#4f46e5,stroke:#4338ca,color:#fff
    style KV fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style STREAM fill:#0ea5e9,stroke:#0369a1,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
from prometheus_client import Counter, Histogram, Gauge, start_http_server

# Conversation metrics
conversations_total = Counter(
    "agent_conversations_total",
    "Total conversations started",
    ["agent_name", "status"],
)

# LLM call metrics
llm_call_duration = Histogram(
    "agent_llm_call_duration_seconds",
    "LLM call latency in seconds",
    ["model", "agent_name"],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0],
)

tokens_used = Counter(
    "agent_tokens_total",
    "Total tokens consumed",
    ["model", "token_type"],  # token_type: prompt or completion
)

# Tool metrics
tool_calls_total = Counter(
    "agent_tool_calls_total",
    "Total tool invocations",
    ["tool_name", "status"],
)

# Active conversations gauge
active_conversations = Gauge(
    "agent_active_conversations",
    "Currently active conversations",
    ["agent_name"],
)

# Start metrics server on port 9090
start_http_server(9090)

Instrumenting the Agent Loop

Wrap the core agent operations to emit metrics on every call.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
import time

async def instrumented_llm_call(model: str, messages: list, agent_name: str):
    start = time.perf_counter()
    try:
        response = await llm_client.chat.completions.create(
            model=model, messages=messages
        )
        duration = time.perf_counter() - start
        llm_call_duration.labels(model=model, agent_name=agent_name).observe(duration)
        tokens_used.labels(model=model, token_type="prompt").inc(
            response.usage.prompt_tokens
        )
        tokens_used.labels(model=model, token_type="completion").inc(
            response.usage.completion_tokens
        )
        return response
    except Exception as e:
        duration = time.perf_counter() - start
        llm_call_duration.labels(model=model, agent_name=agent_name).observe(duration)
        raise

async def instrumented_tool_call(tool_name: str, arguments: dict):
    try:
        result = await execute_tool(tool_name, arguments)
        tool_calls_total.labels(tool_name=tool_name, status="success").inc()
        return result
    except Exception:
        tool_calls_total.labels(tool_name=tool_name, status="error").inc()
        raise

async def run_conversation(user_id: str, message: str, agent_name: str):
    active_conversations.labels(agent_name=agent_name).inc()
    try:
        result = await agent.run(message)
        conversations_total.labels(agent_name=agent_name, status="completed").inc()
        return result
    except Exception:
        conversations_total.labels(agent_name=agent_name, status="failed").inc()
        raise
    finally:
        active_conversations.labels(agent_name=agent_name).dec()

Building the Grafana Dashboard

Configure Prometheus as a Grafana data source, then create panels using PromQL queries for each KPI.

Conversation throughput — requests per minute over time:

rate(agent_conversations_total[5m])

LLM latency P95 — the 95th percentile response time by model:

histogram_quantile(0.95, rate(agent_llm_call_duration_seconds_bucket[5m]))

Token burn rate — tokens per minute, split by prompt vs completion:

rate(agent_tokens_total[5m])

Cost estimation panel — multiply token rates by per-token pricing using a recording rule or Grafana transformation:

rate(agent_tokens_total{token_type="prompt", model="gpt-4o"}[5m]) * 0.0000025
+
rate(agent_tokens_total{token_type="completion", model="gpt-4o"}[5m]) * 0.00001

Tool error rate — percentage of tool calls that fail:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

rate(agent_tool_calls_total{status="error"}[5m])
/ rate(agent_tool_calls_total[5m])

Setting Up Alerts

Define Prometheus alerting rules that fire when agent KPIs breach thresholds.

# prometheus-alerts.yaml
groups:
  - name: agent_alerts
    rules:
      - alert: HighLLMLatency
        expr: histogram_quantile(0.95, rate(agent_llm_call_duration_seconds_bucket[5m])) > 5
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "LLM P95 latency exceeds 5 seconds"

      - alert: HighToolErrorRate
        expr: >
          rate(agent_tool_calls_total{status="error"}[10m])
          / rate(agent_tool_calls_total[10m]) > 0.1
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Tool error rate above 10%"

FAQ

How many Prometheus labels should I use per metric?

Keep label cardinality low. Labels like model, agent_name, and status are fine because they have a small, bounded set of values. Never use labels with high cardinality like user_id or conversation_id — these will cause Prometheus memory and performance issues. Track per-user data in a separate analytics database instead.

Should I track metrics in the agent code or use a sidecar?

Instrument directly in the agent code for LLM-specific metrics like token counts and tool call results, because only the application has that context. Use a sidecar or service mesh for infrastructure metrics like HTTP request rate and network latency. The two approaches complement each other.

How do I estimate costs when using multiple models?

Create a pricing lookup that maps model names to per-token costs, then apply it as a Grafana transformation or Prometheus recording rule. Update the pricing table whenever your provider changes rates. Some teams store costs in a database and join with token metrics in Grafana for more flexibility.


#Dashboards #Grafana #Prometheus #Monitoring #AIAgents #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Infrastructure

Monitoring WebSocket Health: Heartbeats and Prometheus in 2026

How to actually observe a WebSocket fleet: ping/pong heartbeats, Prometheus metrics that matter, dead-man switches, and the alerts that fire before customers notice.

Agentic AI

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

How LangGraph's StateGraph, channels, and reducers actually work — with a working multi-step agent, eval hooks at every node, and the patterns that survive production.

Agentic AI

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

Use LangGraph's checkpointer to make agents resumable across crashes and human-in-the-loop pauses, then replay any checkpoint into your eval pipeline.

Agentic AI

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

Handoffs done right — when one agent should hand control to another, how to preserve context, and how to evaluate the handoff decision itself.

Agentic AI

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.

AI Strategy

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Q1 2026 saw a record acquisition wave: Aircall bought Vogent (May), Meta acquired Manus and PlayAI, OpenAI closed six deals. The voice AI consolidation phase has begun.