Skip to content
Agentic AI
Agentic AI8 min read0 views

Conversational RAG: Maintaining Context Across Turns

Conversational RAG must blend the current question with conversation history. The 2026 patterns for query rewriting, history compression, and reuse.

What Conversational RAG Adds

Standard RAG: take the user's question, embed it, retrieve. Conversational RAG: take the user's current message + conversation history, derive a retrieval query, retrieve. The difference matters because users speak in fragments and references — "what about the second one?" makes no sense without prior context.

By 2026 the patterns are codified. This piece walks through them.

The Core Pattern

flowchart LR
    User[Current msg + history] --> Rewrite[LLM rewrites as standalone query]
    Rewrite --> Retrieve[Retrieve]
    Retrieve --> Generate[Generate response with retrieval]
    Generate --> Update[Update history]

The rewrite step is the key. Without it, fragmented messages produce poor retrieval.

Query Rewriting

The rewrite turns "what about the second one?" into "what are the features of the second product the user mentioned?"

Two approaches:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
  • LLM-driven rewrite: small model rewrites with conversation history as context
  • Slot-filling: extract slots from history and substitute pronouns

LLM-driven is more flexible; slot-filling is cheaper. Most 2026 production systems use LLM-driven rewrite with cheap models.

History Compression

Long histories bloat context. Patterns:

  • Recent N turns full
  • Older turns summarized
  • Specific facts (names, IDs, preferences) extracted into structured form
  • Total context budget enforced

The compaction is independent of the rewrite; both happen on the way to retrieval.

When to Skip Retrieval

Some conversational turns do not need RAG:

  • "Hi"
  • "Thanks"
  • "Can you summarize what we discussed?"

Detect these and skip retrieval. The retrieve-or-skip gate covered earlier applies here too.

A Production Architecture

flowchart TB
    User[User msg] --> Skip{Need retrieval?}
    Skip -->|Yes| Rewrite[Rewrite query]
    Skip -->|No| Direct[Generate directly]
    Rewrite --> Retrieve[Retrieve]
    Retrieve --> Eval[Evaluate retrieval]
    Eval -->|Bad| Refine[Refine + retry]
    Eval -->|Good| Gen[Generate]
    Direct --> Gen

Three gates: retrieve-or-skip, rewrite, retrieval evaluation. Each is a small LLM call; combined they make conversational RAG much more reliable.

Reusing Retrieved Context

Across turns, the same documents may be relevant. Patterns:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

  • Cache retrieved docs at the conversation level (per-session)
  • Reuse for follow-up questions referencing the same topic
  • Re-retrieve when the topic clearly shifts

This cuts retrieval cost on multi-turn deep-dives.

Multi-Source Retrieval

For complex agents:

  • Multiple corpora (KB, manuals, customer-specific docs)
  • Different rewrites for different corpora
  • Fused results

Different corpora often want different query forms. The rewriter can be corpus-aware.

Common Failure Modes

  • Lost antecedent: rewriter does not know what "it" refers to. Fix: longer history window or stronger model.
  • Over-rewriting: rewriter adds context the user did not actually invoke. Fix: prompt the rewriter to be conservative.
  • Stale retrieval: cached retrieval is no longer relevant. Fix: invalidate on topic shift signals.

Evaluation

Conversational RAG eval suites should include:

  • Multi-turn questions with antecedents
  • Topic-shift turns
  • Pronoun-resolution turns
  • Long-history coherence checks

Standard single-question RAG benchmarks miss these.

A Concrete Example

For a CallSphere customer-support voice agent's conversational RAG:

History:
  User: "I'm having trouble with my account."
  Bot: "Sure, I see you have an account. What's the issue?"
  User: "I can't log in."

Rewrite: "How does a user resolve login issues with their account?"

Retrieved: KB articles on login troubleshooting.

Generated reply incorporates retrieval.

The rewrite is what makes the retrieval clean.

Sources

## Conversational RAG: Maintaining Context Across Turns — operator perspective The hard part of conversational RAG is not picking a framework — it is deciding what the agent is *not* allowed to do. Tight scopes, explicit handoffs, and a small set of well-named tools out-perform clever prompting almost every time. That contract is what separates a demo from a production system. CallSphere learned this the expensive way while wiring 37 specialized agents to 90+ tools across 115+ database tables — every integration that didn't enforce schemas at the tool boundary eventually paged someone. ## Why this matters for AI voice + chat agents Agentic AI in a real call center is a different beast than a single-LLM chatbot. Instead of one model answering one prompt, you orchestrate a small team: a router that decides intent, specialists that own a vertical (booking, intake, billing, escalation), and tools that read and write to the same Postgres your CRM trusts. Hand-offs are where most production bugs hide — when Agent A passes context to Agent B, anything that isn't explicit in the message gets lost, and the user feels it as the agent "forgetting." That's why the systems that hold up under load are the ones with typed tool schemas, deterministic state stored outside the conversation, and a hard ceiling on tool calls per session. The cost story is just as important: a multi-agent loop can quietly burn 10x the tokens of a single-LLM design if you let it think out loud at every step. The fix isn't a smarter model, it's smaller agents, shorter prompts, cached system messages, and evals that fail the build when p95 latency or per-session cost regresses. CallSphere runs this pattern across 6 verticals in production, and the rule has held every time: the agent you can debug in five minutes will out-survive the agent that's "smarter" on a benchmark. ## FAQs **Q: What's the hardest part of running conversational RAG live?** A: Scaling comes from constraint, not capability. The deployments that hold up keep each agent narrow, cap tool calls per turn, cache the system prompt, and pin a smaller model for routing while reserving the larger model for synthesis. CallSphere's stack — 37 agents · 90+ tools · 115+ DB tables · 6 verticals live — is sized that way on purpose. **Q: How do you evaluate conversational RAG before shipping?** A: Hard ceilings beat heuristics. A maximum step count, an idempotency key on every tool call, and a fallback to a deterministic script when confidence drops below a threshold are what keep the loop bounded. Evals that simulate noisy inputs catch the rest before they reach a real caller. **Q: Which CallSphere verticals already rely on conversational RAG?** A: It's already in production. Today CallSphere runs this pattern in After-Hours Escalation and Salon, alongside the other live verticals (Healthcare, Real Estate, Salon, Sales, After-Hours Escalation, IT Helpdesk). The same orchestrator code path serves voice and chat — the difference is the tool set the router exposes. ## See it live Want to see real estate agents handle real traffic? Spin up a walkthrough at https://realestate.callsphere.tech or grab 20 minutes on the calendar: https://calendly.com/sagar-callsphere/new-meeting.
Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.

Agentic AI

Building OpenAI Realtime Voice Agents with an Eval Pipeline (2026)

Build a working voice agent with the OpenAI Realtime API + Agents SDK, then bolt on an eval pipeline that catches barge-in failures, hallucinated grounding, and latency regressions.

Agentic AI

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Beyond single-shot RAG — agentic RAG with LangGraph that re-retrieves, self-grades, and rewrites queries. With evals that catch silent retrieval drift.

Agentic AI

Voice Agent Quality Metrics in 2026: WER, Latency, Grounding, and the Ones Most Teams Miss

The full metric set for evaluating production voice agents — STT word error rate, end-to-end latency budgets, RAG grounding, prosody, and the metrics that actually correlate with retention.

Agentic AI

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

Build a production RAG agent with LangChain, then measure faithfulness, answer relevance, and context precision with RAGAS. The four metrics that matter and how to wire them up.

AI Engineering

Cognee: Knowledge-Graph Memory for Agents — A Getting-Started Guide

Cognee builds and queries a knowledge graph from your unstructured data automatically. A walkthrough from install to your first agent integration in production.