RAG Caching Layers: Hit Rates and Cost Reduction Strategies
Cache the right RAG layer and you cut cost 60-80 percent. The 2026 multi-layer cache designs and what to cache where.
Why Caching Matters in RAG
A RAG pipeline has multiple stages: query rewriting, embedding, retrieval, reranking, generation. Each is a potential cache hit. Done well, caching cuts cost and latency by 60-80 percent in production. Done poorly, it introduces stale data or cache pollution.
This piece walks through the 2026 multi-layer cache design.
The Cache Layers
flowchart LR
L1[Query rewrite cache] --> L2[Embedding cache]
L2 --> L3[Retrieval result cache]
L3 --> L4[Rerank cache]
L4 --> L5[Prompt cache]
L5 --> L6[Response cache]
Six potential layers. Most production systems use 3-4 of them.
Query Rewrite Cache
The rewriter takes a user message + history and produces a standalone query. Cache by hash of input. Good for repeated questions in similar contexts.
Hit rate: low (each conversation is unique). Modest savings.
Embedding Cache
Embed the query, cache the embedding. Hit rate depends on query repetition.
For internal-tool RAG with repeated questions, hit rate can be 30-50 percent. For free-form chat, much lower.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Retrieval Result Cache
Given a query, the retrieval result is the list of top-k documents. Cache by query (or query embedding similarity). Hit rate: 20-40 percent for typical RAG workloads.
This is often the highest-value cache because retrieval is the most expensive step (vector search + reranking).
Rerank Cache
Cache reranker outputs. Lower-value than retrieval cache because reranking is cheaper and more query-specific.
Prompt Cache
Provider-side cache (OpenAI, Anthropic, Google). The system prompt + tool definitions + retrieved docs (if stable) get cached prefix. Subsequent calls with the same prefix pay 0.1-0.5x for the cached tokens.
For agentic RAG with stable system prompts, prompt caching is the biggest single cost lever.
Response Cache
Full response to a (query, context) pair. Cache the entire LLM output.
Hit rate: low for chat (each conversation unique); high for FAQ-style RAG. For knowledge-base search, response caching saves the most.
Layered Strategy
flowchart TD
Q[Query] --> Resp{Response cache hit?}
Resp -->|Yes| Done[Return cached response]
Resp -->|No| Ret{Retrieval cache hit?}
Ret -->|Yes| Gen[Generate with cached docs]
Ret -->|No| Run[Full retrieval]
Run --> Gen
Gen --> CacheR[Cache retrieval]
Gen --> CacheResp[Cache response]
Cascade through layers. Each hit short-circuits later layers.
TTL and Invalidation
Each layer needs a TTL strategy:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
- Query rewrite: short (minutes)
- Embedding: long (until embedding model upgrade)
- Retrieval result: medium (hours; until corpus update)
- Prompt cache: short (provider-defined; typically minutes)
- Response cache: per-use case (minutes for chat, longer for FAQ)
When the corpus changes, retrieval and response caches need invalidation. Patterns:
- Tag caches with corpus version
- Bump version on update
- Lazy invalidation (let TTL expire) for non-critical staleness
Cost Math
For a typical 2026 RAG system at moderate volume:
- No caching: $X cost
- Prompt caching only: $0.4X (60% savings)
- Prompt + retrieval caching: $0.3X (70% savings)
- All layers: $0.2X (80% savings)
The marginal value diminishes. Most teams should reach for prompt + retrieval first.
Tenant Isolation
Multi-tenant RAG: caches must not leak across tenants. Patterns:
- Cache keys include tenant ID
- Per-tenant cache namespaces
- No global cross-tenant caching of sensitive content
A leak via cache is a hard-to-debug security issue. Be strict.
Cache Keys
Cache key design matters:
- For query: hash of normalized query (lowercase, deduped whitespace)
- For embedding: hash of input text
- For retrieval: hash of (query, corpus_version, top_k)
- For prompt: provider-defined; structure prompts to maximize cache reuse
What Goes Wrong
flowchart TD
Bad[Bad caching] --> B1[Stale results from corpus updates]
Bad --> B2[Cache pollution from one-off queries]
Bad --> B3[Cross-tenant leak]
Bad --> B4[Hot key thrashing under load]
Bad --> B5[Cache that grows without bounds]
Each is well-studied; the fixes are standard distributed-cache patterns applied to RAG specifics.
Observability
Track per layer:
- Hit rate
- Lookup latency
- Eviction rate
- Cache size
Without these, optimizing caching is guesswork.
Sources
- Anthropic prompt caching — https://docs.anthropic.com
- OpenAI prompt caching — https://platform.openai.com/docs
- Redis cache patterns — https://redis.io/docs
- "Caching for LLM applications" Hamel Husain — https://hamel.dev
- "RAG cost optimization" 2025 — https://blog.langchain.dev
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.