Skip to content
Learn Agentic AI
Learn Agentic AI11 min read10 views

Claude Prompt Caching for Agent Systems: Reducing Costs by 90% on Repeated Contexts

Learn how to use Claude's prompt caching to dramatically reduce costs in agent systems by caching system prompts, tool definitions, and reference documents across multiple requests.

The Cost Problem in Agent Systems

Agent systems are expensive because every turn in the agent loop resends the entire conversation context — system prompt, tool definitions, previous messages, and tool results. A 10-turn agent interaction with a 4,000-token system prompt and 10 tool definitions means sending those same tokens 10 times. For high-volume agent systems processing thousands of conversations daily, this repetition dominates your API bill.

Claude's prompt caching solves this by allowing you to mark content that should be cached on Anthropic's servers. Cached content is read at 90% lower cost than fresh input tokens, and once cached, it persists for 5 minutes (extended each time it is used).

How Prompt Caching Works

You mark content for caching by adding cache_control annotations to your message blocks. Anthropic caches everything up to the annotated block, and subsequent requests that match the cached prefix get the discount.

flowchart LR
    subgraph IN["Inputs"]
        I1["Monthly call volume"]
        I2["Average deal value"]
        I3["Current answer rate"]
        I4["Receptionist cost<br/>per month"]
    end
    subgraph CALC["CallSphere Captures"]
        C1["Missed calls converted<br/>at 24 by 7 coverage"]
        C2["Receptionist payroll<br/>displaced or freed"]
    end
    subgraph OUT["Outputs"]
        O1["Recovered revenue<br/>per month"]
        O2["Operating cost saved"]
        O3((Net ROI<br/>monthly))
    end
    I1 --> C1
    I2 --> C1
    I3 --> C1
    I4 --> C2
    C1 --> O1 --> O3
    C2 --> O2 --> O3
    style C1 fill:#4f46e5,stroke:#4338ca,color:#fff
    style C2 fill:#4f46e5,stroke:#4338ca,color:#fff
    style O3 fill:#059669,stroke:#047857,color:#fff
import anthropic

client = anthropic.Anthropic()

# System prompt with caching enabled
response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a customer support agent for TechCorp. You handle billing inquiries, technical issues, and account management. Always verify customer identity before making account changes. Follow the escalation matrix for issues you cannot resolve...",
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "I need help with my billing"}
    ]
)

The cache_control: {"type": "ephemeral"} marker tells Anthropic to cache this content. The first request pays full input token price plus a small cache write fee. Every subsequent request within 5 minutes that starts with the same text pays only 10% of the input token cost for the cached portion.

Caching Tool Definitions

For agents with many tools, caching tool definitions provides the biggest savings because tool schemas are often large and identical across every request:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
# Large tool definitions — perfect for caching
tools_with_cache = [
    {
        "name": "search_database",
        "description": "Search the product database by various criteria",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {"type": "string"},
                "category": {"type": "string"},
                "price_min": {"type": "number"},
                "price_max": {"type": "number"},
                "in_stock": {"type": "boolean"}
            },
            "required": ["query"]
        }
    },
    {
        "name": "create_ticket",
        "description": "Create a support ticket in the ticketing system",
        "input_schema": {
            "type": "object",
            "properties": {
                "subject": {"type": "string"},
                "priority": {"type": "string", "enum": ["low", "medium", "high", "critical"]},
                "description": {"type": "string"},
                "customer_id": {"type": "string"}
            },
            "required": ["subject", "priority", "description"]
        }
    },
    # ... more tools
]

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": long_system_prompt,
            "cache_control": {"type": "ephemeral"},
        }
    ],
    tools=tools_with_cache,
    messages=messages,
)

When you send the same system prompt and tools across multiple conversations, the cached prefix is reused. The more tools and the longer the system prompt, the more you save.

Caching Reference Documents

Agent systems that reference static documents — product catalogs, policy documents, knowledge bases — benefit enormously from caching:

# Load reference document once, cache it across all queries
with open("product_catalog.txt") as f:
    catalog_text = f.read()

def answer_product_question(question: str) -> str:
    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system=[
            {
                "type": "text",
                "text": "You are a product specialist. Answer questions using the product catalog below.",
            },
            {
                "type": "text",
                "text": catalog_text,
                "cache_control": {"type": "ephemeral"},
            }
        ],
        messages=[
            {"role": "user", "content": question}
        ]
    )
    return response.content[0].text

A 50,000-token product catalog costs full price on the first call but only 10% on every subsequent call within the cache window. For a support system handling 100 queries per hour, this turns a substantial input cost into a rounding error.

Cache-Friendly Architecture

Design your agent's message structure to maximize cache hit rates:

def build_agent_messages(system_prompt: str, tools: list,
                         reference_docs: list[str],
                         conversation_history: list) -> dict:
    """Structure messages for optimal caching.

    Order: system prompt -> reference docs -> tools -> conversation
    Static content comes first so the cached prefix is longest.
    """
    system_blocks = [
        {
            "type": "text",
            "text": system_prompt,
        }
    ]

    # Add reference documents
    for i, doc in enumerate(reference_docs):
        block = {"type": "text", "text": doc}
        # Cache after the last reference doc
        if i == len(reference_docs) - 1:
            block["cache_control"] = {"type": "ephemeral"}
        system_blocks.append(block)

    return {
        "system": system_blocks,
        "tools": tools,
        "messages": conversation_history,
    }

The key principle is prefix matching — caching works from the beginning of the content forward. Put static content (system prompt, reference docs) first, and dynamic content (conversation history) last.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Monitoring Cache Performance

Track cache hit rates to verify your caching strategy works:

def log_cache_metrics(response):
    usage = response.usage
    cached = getattr(usage, "cache_read_input_tokens", 0)
    cache_created = getattr(usage, "cache_creation_input_tokens", 0)
    total_input = usage.input_tokens

    if total_input > 0:
        cache_rate = cached / (cached + total_input) * 100
        print(f"Cache hit rate: {cache_rate:.1f}%")
        print(f"Cached tokens: {cached}, Fresh tokens: {total_input}")
        if cache_created > 0:
            print(f"New cache created: {cache_created} tokens")

A healthy agent system should show 80-95% cache hit rates on the system prompt and tool definitions after the initial warm-up request.

FAQ

How long does the cache last?

Cached content has a 5-minute TTL that resets every time the cache is hit. In practice, any system handling more than one request per 5 minutes keeps the cache warm indefinitely. If your traffic is bursty with long gaps, consider sending a lightweight "keep-alive" request to prevent cache expiration before a burst.

Is there a minimum content size for caching?

Yes. The content must be at least 1,024 tokens for Claude Sonnet and 2,048 tokens for Claude Opus to be eligible for caching. Short system prompts below these thresholds will not be cached even with the cache_control annotation. Combine your system prompt with reference documents to meet the minimum.

Does caching work across different conversations?

Yes, as long as the cached prefix is identical. Two different users asking different questions but sharing the same system prompt and tools will share the cache. This makes caching especially powerful for multi-tenant agent systems where every conversation uses the same base configuration.


#Claude #PromptCaching #CostOptimization #Performance #Python #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

LLM Comparisons

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Browser-side LLMs (WebGPU) in 2026?

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmark...

LLM Comparisons

Self-hosted on-prem stack for Browser-side LLMs (WebGPU): A May 2026 Comparison

Self-hosted on-prem stack for browser-side llms (webgpu) — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.

LLM Comparisons

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Edge / on-device LLM inference in 2026?

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, bench...

LLM Comparisons

Self-hosted on-prem stack for Edge / on-device LLM inference: A May 2026 Comparison

Self-hosted on-prem stack for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and production patterns.

LLM Comparisons

Edge / on-device LLM inference in 2026: Open-source frontier matchup (DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Mistral Large 3)

DeepSeek V4 vs Llama 4 vs Qwen 3.5 vs Mistral Large 3 for edge / on-device llm inference — a May 2026 comparison grounded in current model prices, benchmarks, and...

LLM Comparisons

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro): Which Wins for Multilingual customer support in 2026?

Reasoning models (Claude Mythos, o3, Opus 4.7, DeepSeek V4-Pro) for multilingual customer support — a May 2026 comparison grounded in current model prices, benchm...