Conversational RAG: Maintaining Context Across Turns
Conversational RAG must blend the current question with conversation history. The 2026 patterns for query rewriting, history compression, and reuse.
What Conversational RAG Adds
Standard RAG: take the user's question, embed it, retrieve. Conversational RAG: take the user's current message + conversation history, derive a retrieval query, retrieve. The difference matters because users speak in fragments and references — "what about the second one?" makes no sense without prior context.
By 2026 the patterns are codified. This piece walks through them.
The Core Pattern
flowchart LR
User[Current msg + history] --> Rewrite[LLM rewrites as standalone query]
Rewrite --> Retrieve[Retrieve]
Retrieve --> Generate[Generate response with retrieval]
Generate --> Update[Update history]
The rewrite step is the key. Without it, fragmented messages produce poor retrieval.
Query Rewriting
The rewrite turns "what about the second one?" into "what are the features of the second product the user mentioned?"
Two approaches:
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
- LLM-driven rewrite: small model rewrites with conversation history as context
- Slot-filling: extract slots from history and substitute pronouns
LLM-driven is more flexible; slot-filling is cheaper. Most 2026 production systems use LLM-driven rewrite with cheap models.
History Compression
Long histories bloat context. Patterns:
- Recent N turns full
- Older turns summarized
- Specific facts (names, IDs, preferences) extracted into structured form
- Total context budget enforced
The compaction is independent of the rewrite; both happen on the way to retrieval.
When to Skip Retrieval
Some conversational turns do not need RAG:
- "Hi"
- "Thanks"
- "Can you summarize what we discussed?"
Detect these and skip retrieval. The retrieve-or-skip gate covered earlier applies here too.
A Production Architecture
flowchart TB
User[User msg] --> Skip{Need retrieval?}
Skip -->|Yes| Rewrite[Rewrite query]
Skip -->|No| Direct[Generate directly]
Rewrite --> Retrieve[Retrieve]
Retrieve --> Eval[Evaluate retrieval]
Eval -->|Bad| Refine[Refine + retry]
Eval -->|Good| Gen[Generate]
Direct --> Gen
Three gates: retrieve-or-skip, rewrite, retrieval evaluation. Each is a small LLM call; combined they make conversational RAG much more reliable.
Reusing Retrieved Context
Across turns, the same documents may be relevant. Patterns:
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
- Cache retrieved docs at the conversation level (per-session)
- Reuse for follow-up questions referencing the same topic
- Re-retrieve when the topic clearly shifts
This cuts retrieval cost on multi-turn deep-dives.
Multi-Source Retrieval
For complex agents:
- Multiple corpora (KB, manuals, customer-specific docs)
- Different rewrites for different corpora
- Fused results
Different corpora often want different query forms. The rewriter can be corpus-aware.
Common Failure Modes
- Lost antecedent: rewriter does not know what "it" refers to. Fix: longer history window or stronger model.
- Over-rewriting: rewriter adds context the user did not actually invoke. Fix: prompt the rewriter to be conservative.
- Stale retrieval: cached retrieval is no longer relevant. Fix: invalidate on topic shift signals.
Evaluation
Conversational RAG eval suites should include:
- Multi-turn questions with antecedents
- Topic-shift turns
- Pronoun-resolution turns
- Long-history coherence checks
Standard single-question RAG benchmarks miss these.
A Concrete Example
For a CallSphere customer-support voice agent's conversational RAG:
History:
User: "I'm having trouble with my account."
Bot: "Sure, I see you have an account. What's the issue?"
User: "I can't log in."
Rewrite: "How does a user resolve login issues with their account?"
Retrieved: KB articles on login troubleshooting.
Generated reply incorporates retrieval.
The rewrite is what makes the retrieval clean.
Sources
- LangChain conversational retrieval — https://python.langchain.com/docs
- "Query rewriting for retrieval" research — https://arxiv.org
- "Conversational QA" survey — https://arxiv.org
- LlamaIndex chat engines — https://docs.llamaindex.ai
- Anthropic on multi-turn — https://docs.anthropic.com
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.