Skip to content
AI Engineering
AI Engineering10 min read0 views

Query Rewriting and Multi-Query Expansion for AI Search in 2026

60% of follow-up messages have unresolved coreferences. Query rewriting fixes pronouns, expands recall with multi-query, and applies constraint filters before retrieval ever runs.

TL;DR — Raw user queries are noisy: "what about the second one?" tells the retriever nothing. The 2026 query-rewriting stack handles four jobs in parallel — coreference resolution, expansion (multi-query), step-back abstraction, and constraint extraction — before retrieval ever fires.

The technique

DMQR-RAG (Diverse Multi-Query Rewriting) and the Multi-Query Retriever pattern both rest on one idea: a single query is an under-specified probe. Generate N rewrites covering different angles, retrieve for each, and fuse the lists. Add a step-back rewrite that goes from specific to abstract ("what is the cancellation policy for premium plans on weekends in NYC?" -> "what is the cancellation policy?") to capture parent-context chunks.

For multi-turn voice/chat, the killer step is coreference resolution: replace pronouns and demonstratives with their referents from history. Without it, ~60% of follow-ups retrieve nothing useful.

flowchart LR
  H[Chat history] --> CR[Coreference resolver]
  Q[Raw query] --> CR
  CR --> EX[Multi-query expansion]
  CR --> SB[Step-back abstraction]
  CR --> CN[Constraint extractor]
  EX --> R[Retrieve x N]
  SB --> R
  CN --> FT[Metadata filter]
  R --> FU[RRF fuse]
  FT --> FU
  FU --> A[Agent]

How it works

A small LLM (Haiku 4.5 or Llama 3.1 8B, ~50–80ms) ingests the last 6 turns plus the new utterance, then emits a JSON with: resolved_query, expansions: [3 paraphrases], stepback, filters: { date_range, status, vertical }. Each rewrite hits the retriever in parallel; results are fused via RRF; metadata filters are applied at the index level (cheap) rather than post-retrieval (expensive).

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

The DMQR-RAG paper formalizes four expansion strategies at different information levels — equivalence, generalization, specialization, and adversarial — and shows that diversity matters more than count.

CallSphere implementation

Every CallSphere agent runs a query rewriter. The Healthcare agent resolves "her" -> "patient ID 4421"; UrackIT IT helpdesk resolves "the same error" by injecting the most recent ticket subject; OneRoof real estate resolves "that listing" by pulling the last MLS ID from session memory. The rewriter also extracts constraints — "this week," "under $500k," "in-network" — into structured metadata filters that hit Postgres indexes directly.

37 agents · 90+ tools · 115+ DB tables · 6 verticals. $149 / $499 / $1499, 14-day trial, 22% affiliate. Try the multi-turn flow on /demo or compare verticals at /industries/it-services and /industries/real-estate.

Build steps with code

REWRITE_PROMPT = """Given conversation history and a new user message, output JSON:
{
  "resolved": "<query with all pronouns resolved>",
  "expansions": ["<3 diverse rewrites>"],
  "stepback": "<more abstract version>",
  "filters": {"date_range": "...", "vertical": "...", "status": "..."}
}
History: {history}
New message: {message}"""

def rewrite_and_retrieve(history, msg):
    plan = json.loads(small_llm.complete(REWRITE_PROMPT.format(history=history, message=msg)))
    queries = [plan["resolved"], *plan["expansions"], plan["stepback"]]
    results = [hybrid_retrieve(q, filters=plan["filters"]) for q in queries]
    return rrf_fuse(results)
  1. Pin the rewriter model and prompt — version both as code.
  2. Cache rewrites by (last-3-turns, query) hash.
  3. Log every rewrite for offline eval; the rewriter is the silent ranker.
  4. Apply constraint filters at index level, never in Python.

Pitfalls

  • Over-expansion: 10 rewrites is noise, not signal. 3–4 is the sweet spot.
  • Stepback hallucination: small models invent constraints. Validate with a regex/JSON schema.
  • Latency tax: 80ms rewriter + 4 parallel retrieves can blow a voice budget. Run async and timeout aggressively.
  • Coreference loops: do not let the rewriter resolve a pronoun to itself. Detect and fall back to raw query.

FAQ

Multi-query or HyDE? Multi-query for breadth; HyDE for depth on abstract queries. They compose.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Do I need a finetuned rewriter? No. A well-prompted Haiku 4.5 or Llama 3.1 8B is enough.

Voice or chat? Both. Voice has tighter latency; the rewriter must be sub-100ms.

Constraint extraction or post-filter? Always constraint extraction — index-side filtering is 10–100x cheaper.

Where on the /demo? Toggle "show internals" to watch the rewriter JSON in real time.

Sources

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.

Agentic AI

Building OpenAI Realtime Voice Agents with an Eval Pipeline (2026)

Build a working voice agent with the OpenAI Realtime API + Agents SDK, then bolt on an eval pipeline that catches barge-in failures, hallucinated grounding, and latency regressions.

Agentic AI

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Beyond single-shot RAG — agentic RAG with LangGraph that re-retrieves, self-grades, and rewrites queries. With evals that catch silent retrieval drift.

Agentic AI

Voice Agent Quality Metrics in 2026: WER, Latency, Grounding, and the Ones Most Teams Miss

The full metric set for evaluating production voice agents — STT word error rate, end-to-end latency budgets, RAG grounding, prosody, and the metrics that actually correlate with retention.

Agentic AI

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

Build a production RAG agent with LangChain, then measure faithfulness, answer relevance, and context precision with RAGAS. The four metrics that matter and how to wire them up.

AI Engineering

Cognee: Knowledge-Graph Memory for Agents — A Getting-Started Guide

Cognee builds and queries a knowledge graph from your unstructured data automatically. A walkthrough from install to your first agent integration in production.