Skip to content
Large Language Models
Large Language Models5 min read34 views

LLM Caching Strategies for Cost Optimization: Prompt, Semantic, and KV Caching

Practical techniques to reduce LLM inference costs by 40-80 percent through prompt caching, semantic caching, and KV cache optimization in production systems.

LLM Inference Costs Add Up Fast

At $3-15 per million input tokens for frontier models, LLM costs become significant at scale. A customer support agent handling 10,000 conversations per day with 2,000 tokens per conversation costs $60-300 daily on input tokens alone. Caching strategies can reduce these costs by 40-80 percent while simultaneously improving latency.

Three caching approaches address different patterns: exact prompt caching, semantic caching, and KV cache optimization.

Exact Prompt Caching

The simplest approach: hash the full prompt and cache the response. If the same prompt appears again, return the cached response without calling the LLM.

flowchart LR
    REQ(["Request"])
    BATCH["Continuous batching<br/>vLLM scheduler"]
    PREF{"Prefill or<br/>decode?"}
    PRE["Prefill phase<br/>parallel attention"]
    DEC["Decode phase<br/>token by token"]
    KV[("Paged KV cache")]
    SAMP["Sampling<br/>top-p, temp"]
    STREAM["Stream tokens<br/>to client"]
    REQ --> BATCH --> PREF
    PREF -->|First token| PRE --> KV
    PREF -->|Next token| DEC
    KV --> DEC --> SAMP --> STREAM
    SAMP -->|EOS| DONE(["Response complete"])
    style BATCH fill:#4f46e5,stroke:#4338ca,color:#fff
    style KV fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style STREAM fill:#0ea5e9,stroke:#0369a1,color:#fff
    style DONE fill:#059669,stroke:#047857,color:#fff
import hashlib
import redis
import json

cache = redis.Redis(host="localhost", port=6379, db=0)

async def cached_llm_call(messages: list, model: str, ttl: int = 3600):
    cache_key = hashlib.sha256(
        json.dumps({"messages": messages, "model": model}).encode()
    ).hexdigest()

    cached = cache.get(cache_key)
    if cached:
        return json.loads(cached)

    response = await openai_client.chat.completions.create(
        model=model, messages=messages
    )
    cache.setex(cache_key, ttl, json.dumps(response.to_dict()))
    return response

When Exact Caching Works

  • Repeated system prompts: Many requests share identical system prompts
  • Structured queries: Classification tasks with a fixed set of inputs
  • Batch processing: Re-running analysis on unchanged data

When It Fails

Exact caching has a low hit rate for conversational applications where each message includes unique user input. Even one character difference produces a different hash.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Semantic Caching

Semantic caching matches queries by meaning rather than exact text. "What's the weather in NYC?" and "How's the weather in New York City?" should return the same cached response.

Implementation uses embedding models and vector similarity:

from openai import OpenAI

async def semantic_cache_lookup(query: str, threshold: float = 0.95):
    query_embedding = embed(query)

    # Search vector store for similar previous queries
    results = vector_store.search(
        vector=query_embedding,
        limit=1,
        filter={"created_at": {"$gt": ttl_cutoff}}
    )

    if results and results[0].score > threshold:
        return results[0].metadata["response"]

    # Cache miss: call LLM and store
    response = await llm_call(query)
    vector_store.upsert({
        "vector": query_embedding,
        "metadata": {"query": query, "response": response}
    })
    return response

Tuning the Similarity Threshold

  • 0.98+: Nearly identical queries only. Low hit rate, very safe.
  • 0.95-0.98: Paraphrases and minor variations. Good balance.
  • 0.90-0.95: Loosely similar queries. Higher hit rate but risk of returning irrelevant cached responses.

Test with your actual query distribution to find the right threshold.

Provider-Level Prompt Caching

Anthropic and OpenAI now offer server-side prompt caching that reduces costs for repeated prompt prefixes.

Anthropic Prompt Caching

Anthropic caches prompt prefixes marked with a cache_control parameter. Subsequent requests with the same prefix hit the cache, reducing input token costs by 90 percent for the cached portion. The cache has a 5-minute TTL that resets on each hit.

This is particularly effective for:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

  • Long system prompts (1,000+ tokens)
  • RAG contexts where the retrieved documents are appended to a fixed instruction prefix
  • Multi-turn conversations where the history grows but the system prompt remains constant

OpenAI Cached Tokens

OpenAI automatically caches prompt prefixes longer than 1,024 tokens and charges 50 percent less for cached tokens. Unlike Anthropic's approach, caching is automatic — no API changes required.

KV Cache Optimization

For self-hosted models, the key-value cache stored during autoregressive generation is a major memory and compute bottleneck.

Techniques

  • PagedAttention (vLLM): Manages KV cache memory like virtual memory pages, eliminating fragmentation and enabling higher batch sizes
  • Prefix caching: Shares KV cache entries across requests with identical prompt prefixes, avoiding redundant computation
  • Quantized KV cache: Storing cached keys and values in FP8 or INT8 precision reduces memory by 50 percent with minimal quality impact

Cost Savings Calculator

For a system processing 100,000 LLM calls per day:

Strategy Typical Hit Rate Cost Reduction
Exact prompt cache 5-15% 5-15%
Semantic cache 15-40% 15-40%
Provider prompt caching 60-90% of tokens 30-50%
Combined approach 50-80%

The strategies are complementary. A production system should layer exact caching (cheapest to implement), semantic caching (catches paraphrases), and provider-level caching (reduces per-token cost for cache misses).

Sources: Anthropic Prompt Caching Documentation | vLLM PagedAttention Paper | GPTCache GitHub

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Agentic AI

From Trace to Production Fix: An End-to-End Observability Workflow for Agents

A real workflow: user complaint → LangSmith trace → reproduce in dataset → fix → ship → re-eval. Principal-engineer notes, real numbers, honest tradeoffs.

Agentic AI

Regression Testing for AI Agents: Catching Silent Breakage Before Users Do

Non-deterministic agents break silently when prompts, models, or tools change. Build a regression pipeline with frozen datasets, semantic diffing, and gate thresholds.

Agentic AI

Cost-Aware Agent Evaluation: Putting Token Spend, Latency, and Quality on the Same Dashboard

Eval scores alone mislead. Here is how we build a Pareto view across cost, latency, and quality so agent releases ship on signal, not vibes.

Agentic AI

OpenAI Computer-Use Agents (CUA) in Production: Build + Evaluate a Real Workflow (2026)

Build a working computer-use agent with the OpenAI Computer Use tool — clicks, types, scrolls a real browser — then evaluate task success on a benchmark suite.

Agentic AI

Online vs Offline Agent Evaluation: The Pre-Deploy / Post-Deploy Split

Offline evals catch regressions before deploy on a fixed dataset. Online evals catch real-world drift on live traffic. You need both — here is how we run them.

Agentic AI

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.