Skip to content
Learn Agentic AI
Learn Agentic AI11 min read9 views

OpenAI Agents SDK Performance Tuning: Reducing Latency and Token Usage in Production

Optimize your OpenAI Agents SDK deployments for production with techniques for connection reuse, prompt compression, tool result caching, parallel tool execution, and token budget management.

Where Agents Spend Time and Tokens

Before optimizing, you need to understand the cost profile of an agent run. There are three main sources of latency and token usage: model calls (the LLM inference itself), tool execution (network calls, database queries, computation), and conversation history (accumulated tokens from multi-turn sessions).

Each requires a different optimization strategy. This guide covers practical techniques for each category.

Connection Reuse and Client Management

Creating a new HTTP client for every model call adds 50-200ms of overhead for TLS handshake and connection setup. Reuse clients across requests.

flowchart LR
    INPUT(["User input"])
    AGENT["Agent<br/>name plus instructions"]
    HAND{"Handoff to<br/>another agent?"}
    SUB["Sub-agent<br/>specialist"]
    GUARD{"Guardrail<br/>passed?"}
    TOOL["Tool call"]
    SDK[("Tracing<br/>OpenAI dashboard")]
    OUT(["Final output"])
    INPUT --> AGENT --> HAND
    HAND -->|Yes| SUB --> GUARD
    HAND -->|No| GUARD
    GUARD -->|Yes| TOOL --> AGENT
    GUARD -->|Block| OUT
    AGENT --> OUT
    AGENT --> SDK
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style SDK fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
from agents import Agent, Runner
from openai import AsyncOpenAI
import httpx

# BAD: new client every request
async def handle_slow(message: str):
    result = await Runner.run(agent, input=message)
    return result.final_output

# GOOD: shared client with connection pooling
_shared_client = AsyncOpenAI(
    http_client=httpx.AsyncClient(
        limits=httpx.Limits(
            max_connections=50,
            max_keepalive_connections=20,
            keepalive_expiry=30,
        ),
        timeout=httpx.Timeout(30.0, connect=5.0),
    )
)

agent = Agent(
    name="fast_agent",
    instructions="You are a helpful assistant.",
    # The SDK uses the default OpenAI client, but you can
    # configure it at the module level for connection reuse
)

Prompt Optimization: Fewer Tokens, Same Quality

Every token in your agent's instructions costs money and adds latency. Compress your prompts without losing clarity.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
# VERBOSE: 89 tokens
verbose_instructions = """
You are a customer support agent for our company. Your role is to help
customers with their questions and concerns. You should always be polite,
professional, and helpful. When you don't know the answer to a question,
you should let the customer know that you will escalate their issue to
a senior support agent who can help them further.
"""

# COMPRESSED: 42 tokens — same behavior
compressed_instructions = """Customer support agent. Be polite and professional.
If unsure, escalate to senior support. Use tools to look up account info."""

# STRUCTURED: Clear format reduces ambiguity, saving re-prompt tokens
structured_instructions = """Role: Customer support agent
Behavior: Polite, professional, concise
Tools: Use search_account before answering account questions
Escalation: Hand off to senior_agent if issue is unresolved after 2 attempts
Format: Reply in 1-3 sentences unless user asks for detail"""

optimized_agent = Agent(
    name="support",
    instructions=structured_instructions,
)

Tool Result Caching

If a tool returns the same data for the same inputs, cache it. This saves both tool execution time and the tokens spent on redundant tool calls.

from functools import lru_cache
from agents import function_tool
import hashlib
import json
import time

class ToolCache:
    def __init__(self, ttl_seconds: int = 300):
        self._cache: dict[str, tuple[str, float]] = {}
        self.ttl = ttl_seconds

    def get(self, key: str) -> str | None:
        if key in self._cache:
            value, timestamp = self._cache[key]
            if time.monotonic() - timestamp < self.ttl:
                return value
            del self._cache[key]
        return None

    def set(self, key: str, value: str):
        self._cache[key] = (value, time.monotonic())

    def make_key(self, tool_name: str, **kwargs) -> str:
        raw = json.dumps({"tool": tool_name, **kwargs}, sort_keys=True)
        return hashlib.sha256(raw.encode()).hexdigest()

cache = ToolCache(ttl_seconds=600)

@function_tool
async def get_product_info(product_id: str) -> str:
    """Get product information by ID."""
    cache_key = cache.make_key("get_product_info", product_id=product_id)
    cached = cache.get(cache_key)
    if cached:
        return cached

    # Actual lookup (expensive)
    import httpx
    async with httpx.AsyncClient() as client:
        resp = await client.get(f"https://api.example.com/products/{product_id}")
        result = resp.text

    cache.set(cache_key, result)
    return result

Conversation History Trimming

Long conversations accumulate tokens fast. Trim history to keep costs under control.

from agents.items import TResponseInputItem

class ConversationTrimmer:
    def __init__(self, max_turns: int = 20, max_chars: int = 50000):
        self.max_turns = max_turns
        self.max_chars = max_chars

    def trim(self, history: list[TResponseInputItem]) -> list[TResponseInputItem]:
        # Keep system messages and the most recent turns
        system_msgs = [m for m in history if isinstance(m, dict) and m.get("role") == "system"]
        non_system = [m for m in history if not (isinstance(m, dict) and m.get("role") == "system")]

        # Keep last N turns
        trimmed = non_system[-self.max_turns * 2:]  # 2 items per turn (user + assistant)

        # Truncate if still too long
        result = system_msgs + trimmed
        total_chars = sum(len(str(m)) for m in result)

        while total_chars > self.max_chars and len(result) > len(system_msgs) + 2:
            result.pop(len(system_msgs))  # Remove oldest non-system message
            total_chars = sum(len(str(m)) for m in result)

        return result

trimmer = ConversationTrimmer(max_turns=15, max_chars=40000)

Parallel Tool Execution

When the agent calls multiple tools that are independent, execute them concurrently.

import asyncio
from agents import function_tool

@function_tool
async def get_user_orders(user_id: str) -> str:
    """Fetch user order history."""
    await asyncio.sleep(0.5)  # Simulates API call
    return f"3 orders for user {user_id}"

@function_tool
async def get_user_profile(user_id: str) -> str:
    """Fetch user profile."""
    await asyncio.sleep(0.3)  # Simulates API call
    return f"Profile for user {user_id}: Premium tier"

@function_tool
async def get_user_tickets(user_id: str) -> str:
    """Fetch user support tickets."""
    await asyncio.sleep(0.4)  # Simulates API call
    return f"2 open tickets for user {user_id}"

# The SDK handles parallel tool execution automatically when the
# model requests multiple tools in a single response. To encourage
# this, mention in agent instructions:

parallel_agent = Agent(
    name="support",
    instructions="""Customer support agent.
    When looking up user information, call get_user_profile,
    get_user_orders, and get_user_tickets simultaneously.""",
    tools=[get_user_orders, get_user_profile, get_user_tickets],
)

Token Budget Management

Set hard limits on token usage per agent run to prevent cost overruns.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

from agents import ModelSettings

budget_agent = Agent(
    name="budget_agent",
    instructions="Be concise. Answer in 2-3 sentences maximum.",
    model_settings=ModelSettings(
        max_tokens=500,         # Limit output tokens
        temperature=0.3,        # Lower temperature = more deterministic = fewer retries
    ),
)

FAQ

What is the biggest performance win for most agent systems?

Connection reuse and prompt compression together typically cut latency by 30-50%. Connection reuse eliminates TLS overhead on every model call, and shorter prompts reduce both input token costs and time-to-first-token. Start with these two before investing in more complex optimizations.

How do I measure token usage per agent run?

The SDK returns usage information in the RunResult. Access result.raw_responses to get token counts from each model call. Sum up input_tokens and output_tokens across all responses to get total usage for the run. Log these to your metrics system to track trends.

Should I use a smaller model for simple tasks?

Yes. Route simple queries (greetings, FAQ answers, status checks) to faster, cheaper models like GPT-4o-mini while keeping complex reasoning on GPT-4o or Claude. Use the custom model provider pattern to dynamically select models based on task complexity detected by a lightweight classifier.


#OpenAIAgentsSDK #Performance #Optimization #Latency #TokenUsage #Production #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Latency Benchmarking AI Voice Agent Vendors (2026)

Vapi 465ms optimal, Retell 580-620ms, Bland ~800ms, ElevenLabs 400-600ms — but those are best-case. We design a fair benchmark harness, P95 measurement, and a reproducible methodology for 2026.

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

AI Infrastructure

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

WebTransport is Baseline as of March 2026. Media Over QUIC ships in production within the year. Here is what changes for AI voice agents — and what stays the same.

Agentic AI

Parallel Tool Calling in the OpenAI Agents SDK: When It Helps, When It Hurts (2026)

OpenAI's parallel function calling can cut latency in half — or burn money on dependent calls. The architecture, code, and an eval that proves the win.

Agentic AI

OpenAI Agents SDK vs Assistants API in 2026: Migration Guide with Eval Parity

Honest principal-engineer comparison of the OpenAI Agents SDK and the legacy Assistants API, with a migration checklist and eval-parity strategy so you don't ship regressions.

Agentic AI

Tool Selection Accuracy: The Eval Most Teams Skip — and Should Not (2026)

Your agent picked the wrong tool 12% of the time and the final answer was still right. That's a latent bug. Here's the eval pipeline that surfaces it.