What Multi-Hop Means

Single-hop questions can be answered from one retrieved chunk. Multi-hop questions need multiple chunks chained: "Who is the manager of the engineer who shipped feature X?" requires finding the engineer first, then their manager.

Naive RAG retrieves k chunks for a single embedding query and feeds them to the model. Multi-hop questions confuse this pattern; the right chunks may not all match a single embedding query.

By 2026 multi-hop RAG is its own design discipline. This piece walks through it.

The Three Patterns

flowchart TB
    M[Multi-hop strategies] --> M1[Decompose-then-retrieve]
    M --> M2[Iterative-retrieve]
    M --> M3[Graph-traverse]

Decompose-then-Retrieve

Use an LLM to decompose the question into atomic sub-questions. Retrieve for each. Compose.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Q: Who is the manager of the engineer who shipped feature X?
Decompose:
  Q1: Who shipped feature X?
  Q2: Who is the manager of [answer to Q1]?

Each sub-question is single-hop and retrieves cleanly.

Iterative-Retrieve

Retrieve, look at the result, decide what's missing, retrieve again. Continue until the question is answered or budget is exhausted.

flowchart LR
    Q[Question] --> R1[Retrieve 1]
    R1 --> Eval1[LLM evaluates: enough?]
    Eval1 -->|No| Refine[Refine query]
    Refine --> R2[Retrieve 2]
    R2 --> Eval2[Eval]
    Eval2 -->|Yes| Answer[Answer]

Flexible; can handle questions whose decomposition is not obvious upfront.

Graph-Traverse

Use a knowledge graph for the structured part of the question. RAG for the unstructured. Blend results.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

For "manager of the engineer who shipped X": graph stores the engineer-manager relationship; vector store stores feature-shipping records. Query both.

When Each One Wins

Decompose-then-retrieve: when the decomposition is obvious and bounded
Iterative-retrieve: when the decomposition emerges from intermediate results
Graph-traverse: when relationships are first-class and known

Most production multi-hop RAG in 2026 uses iterative-retrieve as the default with graph-traverse mixed in for relationship-heavy domains.

A Production Pattern

flowchart TB
    Q[Question] --> Class[Classify hop count]
    Class -->|1-hop| Simple[Standard RAG]
    Class -->|multi-hop| Iter[Iterative-retrieve]
    Iter --> R1[Retrieve]
    R1 --> Sub[LLM extracts sub-question]
    Sub --> R2[Retrieve sub-question]
    R2 --> Combine[LLM combines]
    Combine --> Done[Answer]

The classifier saves cost on simple questions. The iterative loop handles the hard ones with budget caps.

Cost Control

Multi-hop RAG is expensive because it makes multiple LLM and retriever calls. Patterns:

Cap iterations (typically 3-5)
Use cheap LLMs for sub-question extraction
Cache intermediate results (the same sub-question may appear across user queries)
Pre-decompose common question shapes

Evaluation

flowchart LR
    Test[Test questions] --> Multi[Multi-hop benchmark]
    Multi --> Recall[Recall: did we retrieve all needed chunks?]
    Multi --> Compose[Composition: did we combine correctly?]
    Multi --> Hop[Hop count: minimum needed]

Multi-hop benchmarks (HotpotQA, 2WikiMultihopQA) test retrieval over hops. Use them, but augment with your own multi-hop questions over your corpus.

Common Failure Modes

Decomposition gets the wrong sub-questions
Iterative loop never converges (cap iterations)
One bad sub-result poisons the chain (validate intermediate results)
Composition step misses a key fact (use stronger model for composition)

Sources

"Self-Ask" Press et al. — https://arxiv.org/abs/2210.03350
"IRCoT" Trivedi et al. — https://arxiv.org/abs/2212.10509
HotpotQA dataset — https://hotpotqa.github.io
"Multi-hop QA" survey 2025 — https://arxiv.org
LangGraph multi-hop recipes — https://langchain-ai.github.io/langgraph

## Multi-Hop RAG: Designing Retrieval Pipelines for Complex Questions — operator perspective There is a clean theory behind multi-Hop RAG and there is a messier reality. The theory says agents reason, plan, and act. The reality is that agents stall on ambiguous tool outputs and double-spend tokens unless you put hard limits in place. The teams that ship fastest treat multi-hop rag as an evals problem first and a modeling problem second. They write the failure cases into the regression set on day one, not after the first incident. ## Why this matters for AI voice + chat agents Agentic AI in a real call center is a different beast than a single-LLM chatbot. Instead of one model answering one prompt, you orchestrate a small team: a router that decides intent, specialists that own a vertical (booking, intake, billing, escalation), and tools that read and write to the same Postgres your CRM trusts. Hand-offs are where most production bugs hide — when Agent A passes context to Agent B, anything that isn't explicit in the message gets lost, and the user feels it as the agent "forgetting." That's why the systems that hold up under load are the ones with typed tool schemas, deterministic state stored outside the conversation, and a hard ceiling on tool calls per session. The cost story is just as important: a multi-agent loop can quietly burn 10x the tokens of a single-LLM design if you let it think out loud at every step. The fix isn't a smarter model, it's smaller agents, shorter prompts, cached system messages, and evals that fail the build when p95 latency or per-session cost regresses. CallSphere runs this pattern across 6 verticals in production, and the rule has held every time: the agent you can debug in five minutes will out-survive the agent that's "smarter" on a benchmark. ## FAQs **Q: How do you scale multi-Hop RAG without blowing up token cost?** A: Scaling comes from constraint, not capability. The deployments that hold up keep each agent narrow, cap tool calls per turn, cache the system prompt, and pin a smaller model for routing while reserving the larger model for synthesis. CallSphere's stack — 37 agents · 90+ tools · 115+ DB tables · 6 verticals live — is sized that way on purpose. **Q: What stops multi-Hop RAG from looping forever on edge cases?** A: Hard ceilings beat heuristics. A maximum step count, an idempotency key on every tool call, and a fallback to a deterministic script when confidence drops below a threshold are what keep the loop bounded. Evals that simulate noisy inputs catch the rest before they reach a real caller. **Q: Where does CallSphere use multi-Hop RAG in production today?** A: It's already in production. Today CallSphere runs this pattern in Real Estate, alongside the other live verticals (Healthcare, Real Estate, Salon, Sales, After-Hours Escalation, IT Helpdesk). The same orchestrator code path serves voice and chat — the difference is the tool set the router exposes. ## See it live Want to see after-hours escalation agents handle real traffic? Spin up a walkthrough at https://escalation.callsphere.tech or grab 20 minutes on the calendar: https://calendly.com/sagar-callsphere/new-meeting.

Multi-Hop RAG: Designing Retrieval Pipelines for Complex Questions

What Multi-Hop Means

The Three Patterns

Decompose-then-Retrieve

Iterative-Retrieve

Graph-Traverse

When Each One Wins

A Production Pattern

Cost Control

Evaluation

Common Failure Modes

Sources

Try CallSphere AI Voice Agents

Related Articles You May Like

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

Anthropic Skills System: Loadable Tool Packs for Claude Agents

Cognee: Knowledge-Graph Memory for Agents — A Getting-Started Guide

Designing Agent Loops with the Claude Agent SDK