RAG Failure Mode Catalog: Why Pipelines Don't Find the Right Doc
Twelve recurring RAG failure modes from production deployments and the fixes for each in 2026.
Why a Catalog
Production RAG systems fail in repeating ways. Knowing the catalog lets you diagnose quickly. Most "the AI gave a wrong answer" reports trace back to one of twelve failure modes documented across 2025-2026 production systems.
This piece is the working catalog.
The Twelve
flowchart TB
F[Failure modes] --> F1[1. Wrong chunk]
F --> F2[2. Lost in middle]
F --> F3[3. Stale corpus]
F --> F4[4. Embedding model mismatch]
F --> F5[5. Chunk too small]
F --> F6[6. Chunk too large]
F --> F7[7. Vocabulary gap]
F --> F8[8. Reranker confused]
F --> F9[9. Cross-tenant leak]
F --> F10[10. Coverage gap]
F --> F11[11. Conflicting docs]
F --> F12[12. PII / privacy leak]
1. Wrong Chunk
The retriever returned a relevant-looking but actually wrong chunk. Common with broad keywords.
Fix: stronger reranker; query rewriting; hybrid retrieval.
2. Lost in Middle
The right chunk was retrieved but the LLM ignored it because of position in the prompt.
Fix: rerank to put best chunks first; use shorter context windows; structured separators.
3. Stale Corpus
The corpus has not been re-indexed since a relevant document was added or updated.
Fix: streaming index updates; corpus version tracking; freshness metrics.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
4. Embedding Model Mismatch
Queries embedded with one model, corpus with another. Distance computations are nonsense.
Fix: re-embed corpus when embedding model changes; tag embeddings with model version.
5. Chunk Too Small
Chunks are 100 tokens; the relevant context is in the surrounding 500 tokens. Retrieval gets the chunk; the model lacks context to use it.
Fix: larger chunk sizes; chunk overlap; expanded context retrieval.
6. Chunk Too Large
Chunks are 2000 tokens; relevant facts dilute among irrelevant content. Embedding does not represent any single concept well.
Fix: smaller chunks; semantic chunking; multi-granularity indexing.
7. Vocabulary Gap
Domain terminology not represented in the embedding model. Codes, abbreviations, technical terms miss.
Fix: domain-tuned embeddings; hybrid retrieval (BM25 catches exact matches); vocabulary expansion.
8. Reranker Confused
Cross-encoder reranker shifts the wrong chunk to the top.
Fix: use a stronger or domain-tuned reranker; combine reranker with RRF fallback; validate rerank improvements on your data.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
9. Cross-Tenant Leak
Documents from tenant A retrieved for tenant B's query.
Fix: per-tenant indexes; per-tenant filters baked into every query; audit log of retrievals.
10. Coverage Gap
The right document is not in the corpus at all.
Fix: corpus auditing; coverage testing on known questions; expansion of source corpora.
11. Conflicting Docs
Two retrieved documents contradict each other; the LLM confidently picks one.
Fix: explicit conflict-resolution prompts ("if sources conflict, note the conflict"); date-aware ranking; provenance tracking.
12. PII / Privacy Leak
Sensitive data appears in retrieved chunks where it should not.
Fix: PII redaction at index time; access-control filtering at retrieval time; redaction at generation time.
Diagnosis Workflow
flowchart LR
Bad[Bad answer reported] --> Trace[Pull trace]
Trace --> Check[Check retrieved chunks]
Check --> Match{Right chunks?}
Match -->|No| RetFail[Retrieval failure: 1, 3, 4, 5, 6, 7, 8, 10]
Match -->|Yes| GenFail[Generation failure: 2, 11, 12]
Was the retrieval wrong, or did the model fail to use right retrieval correctly? Different failures, different fixes.
Test Cases for Each
A 2026 RAG eval suite should include tests targeting each failure mode:
- Wrong chunk: ambiguous queries
- Lost in middle: long contexts with answer late
- Stale corpus: queries about recent updates
- Cross-tenant: multi-tenant test data
- Coverage gap: known-not-in-corpus queries
If you do not test for them, you discover them in production.
Sources
- "RAG failure modes" Hamel Husain — https://hamel.dev
- "Analyzing RAG failures" research — https://arxiv.org
- LangSmith eval patterns — https://docs.smith.langchain.com
- "RAG production debugging" Anthropic — https://www.anthropic.com/engineering
- "Lost in the middle" Liu et al. — https://arxiv.org/abs/2307.03172
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.