OpenAI Agents SDK Performance Tuning: Reducing Latency and Token Usage in Production
Optimize your OpenAI Agents SDK deployments for production with techniques for connection reuse, prompt compression, tool result caching, parallel tool execution, and token budget management.
Where Agents Spend Time and Tokens
Before optimizing, you need to understand the cost profile of an agent run. There are three main sources of latency and token usage: model calls (the LLM inference itself), tool execution (network calls, database queries, computation), and conversation history (accumulated tokens from multi-turn sessions).
Each requires a different optimization strategy. This guide covers practical techniques for each category.
Connection Reuse and Client Management
Creating a new HTTP client for every model call adds 50-200ms of overhead for TLS handshake and connection setup. Reuse clients across requests.
flowchart LR
INPUT(["User input"])
AGENT["Agent<br/>name plus instructions"]
HAND{"Handoff to<br/>another agent?"}
SUB["Sub-agent<br/>specialist"]
GUARD{"Guardrail<br/>passed?"}
TOOL["Tool call"]
SDK[("Tracing<br/>OpenAI dashboard")]
OUT(["Final output"])
INPUT --> AGENT --> HAND
HAND -->|Yes| SUB --> GUARD
HAND -->|No| GUARD
GUARD -->|Yes| TOOL --> AGENT
GUARD -->|Block| OUT
AGENT --> OUT
AGENT --> SDK
style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
style SDK fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
style OUT fill:#059669,stroke:#047857,color:#fff
from agents import Agent, Runner
from openai import AsyncOpenAI
import httpx
# BAD: new client every request
async def handle_slow(message: str):
result = await Runner.run(agent, input=message)
return result.final_output
# GOOD: shared client with connection pooling
_shared_client = AsyncOpenAI(
http_client=httpx.AsyncClient(
limits=httpx.Limits(
max_connections=50,
max_keepalive_connections=20,
keepalive_expiry=30,
),
timeout=httpx.Timeout(30.0, connect=5.0),
)
)
agent = Agent(
name="fast_agent",
instructions="You are a helpful assistant.",
# The SDK uses the default OpenAI client, but you can
# configure it at the module level for connection reuse
)
Prompt Optimization: Fewer Tokens, Same Quality
Every token in your agent's instructions costs money and adds latency. Compress your prompts without losing clarity.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
# VERBOSE: 89 tokens
verbose_instructions = """
You are a customer support agent for our company. Your role is to help
customers with their questions and concerns. You should always be polite,
professional, and helpful. When you don't know the answer to a question,
you should let the customer know that you will escalate their issue to
a senior support agent who can help them further.
"""
# COMPRESSED: 42 tokens — same behavior
compressed_instructions = """Customer support agent. Be polite and professional.
If unsure, escalate to senior support. Use tools to look up account info."""
# STRUCTURED: Clear format reduces ambiguity, saving re-prompt tokens
structured_instructions = """Role: Customer support agent
Behavior: Polite, professional, concise
Tools: Use search_account before answering account questions
Escalation: Hand off to senior_agent if issue is unresolved after 2 attempts
Format: Reply in 1-3 sentences unless user asks for detail"""
optimized_agent = Agent(
name="support",
instructions=structured_instructions,
)
Tool Result Caching
If a tool returns the same data for the same inputs, cache it. This saves both tool execution time and the tokens spent on redundant tool calls.
from functools import lru_cache
from agents import function_tool
import hashlib
import json
import time
class ToolCache:
def __init__(self, ttl_seconds: int = 300):
self._cache: dict[str, tuple[str, float]] = {}
self.ttl = ttl_seconds
def get(self, key: str) -> str | None:
if key in self._cache:
value, timestamp = self._cache[key]
if time.monotonic() - timestamp < self.ttl:
return value
del self._cache[key]
return None
def set(self, key: str, value: str):
self._cache[key] = (value, time.monotonic())
def make_key(self, tool_name: str, **kwargs) -> str:
raw = json.dumps({"tool": tool_name, **kwargs}, sort_keys=True)
return hashlib.sha256(raw.encode()).hexdigest()
cache = ToolCache(ttl_seconds=600)
@function_tool
async def get_product_info(product_id: str) -> str:
"""Get product information by ID."""
cache_key = cache.make_key("get_product_info", product_id=product_id)
cached = cache.get(cache_key)
if cached:
return cached
# Actual lookup (expensive)
import httpx
async with httpx.AsyncClient() as client:
resp = await client.get(f"https://api.example.com/products/{product_id}")
result = resp.text
cache.set(cache_key, result)
return result
Conversation History Trimming
Long conversations accumulate tokens fast. Trim history to keep costs under control.
from agents.items import TResponseInputItem
class ConversationTrimmer:
def __init__(self, max_turns: int = 20, max_chars: int = 50000):
self.max_turns = max_turns
self.max_chars = max_chars
def trim(self, history: list[TResponseInputItem]) -> list[TResponseInputItem]:
# Keep system messages and the most recent turns
system_msgs = [m for m in history if isinstance(m, dict) and m.get("role") == "system"]
non_system = [m for m in history if not (isinstance(m, dict) and m.get("role") == "system")]
# Keep last N turns
trimmed = non_system[-self.max_turns * 2:] # 2 items per turn (user + assistant)
# Truncate if still too long
result = system_msgs + trimmed
total_chars = sum(len(str(m)) for m in result)
while total_chars > self.max_chars and len(result) > len(system_msgs) + 2:
result.pop(len(system_msgs)) # Remove oldest non-system message
total_chars = sum(len(str(m)) for m in result)
return result
trimmer = ConversationTrimmer(max_turns=15, max_chars=40000)
Parallel Tool Execution
When the agent calls multiple tools that are independent, execute them concurrently.
import asyncio
from agents import function_tool
@function_tool
async def get_user_orders(user_id: str) -> str:
"""Fetch user order history."""
await asyncio.sleep(0.5) # Simulates API call
return f"3 orders for user {user_id}"
@function_tool
async def get_user_profile(user_id: str) -> str:
"""Fetch user profile."""
await asyncio.sleep(0.3) # Simulates API call
return f"Profile for user {user_id}: Premium tier"
@function_tool
async def get_user_tickets(user_id: str) -> str:
"""Fetch user support tickets."""
await asyncio.sleep(0.4) # Simulates API call
return f"2 open tickets for user {user_id}"
# The SDK handles parallel tool execution automatically when the
# model requests multiple tools in a single response. To encourage
# this, mention in agent instructions:
parallel_agent = Agent(
name="support",
instructions="""Customer support agent.
When looking up user information, call get_user_profile,
get_user_orders, and get_user_tickets simultaneously.""",
tools=[get_user_orders, get_user_profile, get_user_tickets],
)
Token Budget Management
Set hard limits on token usage per agent run to prevent cost overruns.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
from agents import ModelSettings
budget_agent = Agent(
name="budget_agent",
instructions="Be concise. Answer in 2-3 sentences maximum.",
model_settings=ModelSettings(
max_tokens=500, # Limit output tokens
temperature=0.3, # Lower temperature = more deterministic = fewer retries
),
)
FAQ
What is the biggest performance win for most agent systems?
Connection reuse and prompt compression together typically cut latency by 30-50%. Connection reuse eliminates TLS overhead on every model call, and shorter prompts reduce both input token costs and time-to-first-token. Start with these two before investing in more complex optimizations.
How do I measure token usage per agent run?
The SDK returns usage information in the RunResult. Access result.raw_responses to get token counts from each model call. Sum up input_tokens and output_tokens across all responses to get total usage for the run. Log these to your metrics system to track trends.
Should I use a smaller model for simple tasks?
Yes. Route simple queries (greetings, FAQ answers, status checks) to faster, cheaper models like GPT-4o-mini while keeping complex reasoning on GPT-4o or Claude. Use the custom model provider pattern to dynamically select models based on task complexity detected by a lightweight classifier.
#OpenAIAgentsSDK #Performance #Optimization #Latency #TokenUsage #Production #AgenticAI #LearnAI #AIEngineering
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.