Debugging RAG Retrieval: When the Agent Retrieves Wrong or Irrelevant Documents

The Right Question, the Wrong Answer

Your RAG-powered agent has access to thousands of documents. A user asks a straightforward question. The agent retrieves three chunks, synthesizes a response, and delivers it confidently. The response is wrong — not because the model hallucinated, but because it was given the wrong documents to work with.

RAG retrieval failures are particularly dangerous because the agent has no way to know it retrieved bad chunks. It trusts what it receives and generates a plausible-sounding answer from irrelevant source material. Debugging this requires inspecting every stage of the retrieval pipeline.

The RAG Retrieval Pipeline

Every RAG query passes through four stages, and failures can occur at each one:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    Q(["User query"])
    EMB["Embed query<br/>text-embedding-3"]
    VEC[("Vector DB<br/>pgvector or Pinecone")]
    RET["Top-k retrieval<br/>k = 8"]
    PROMPT["Augmented prompt<br/>system plus context"]
    LLM["LLM generation<br/>Claude or GPT"]
    CITE["Inline citations<br/>and page anchors"]
    OUT(["Grounded answer"])
    Q --> EMB --> VEC --> RET --> PROMPT --> LLM --> CITE --> OUT
    style EMB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style VEC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style OUT fill:#059669,stroke:#047857,color:#fff

Query formation: The user question is transformed into a search query
Embedding: The query is converted to a vector
Vector search: The nearest neighbor chunks are retrieved
Relevance filtering: Results below a threshold are discarded

Build a debugger that captures data at every stage:

import numpy as np
from dataclasses import dataclass, field

@dataclass
class RetrievalDebugInfo:
    original_query: str = ""
    search_query: str = ""
    query_embedding: list[float] = field(default_factory=list)
    raw_results: list[dict] = field(default_factory=list)
    filtered_results: list[dict] = field(default_factory=list)
    similarity_scores: list[float] = field(default_factory=list)

class RAGDebugger:
    def __init__(self, embedding_client, vector_store):
        self.embedding_client = embedding_client
        self.vector_store = vector_store

    async def debug_retrieve(
        self,
        query: str,
        top_k: int = 5,
        threshold: float = 0.7,
    ) -> RetrievalDebugInfo:
        info = RetrievalDebugInfo(original_query=query)

        # Stage 1: Query formation
        info.search_query = query  # or apply transformation
        print(f"[1] Query: {info.search_query}")

        # Stage 2: Embedding
        response = await self.embedding_client.embeddings.create(
            model="text-embedding-3-small",
            input=info.search_query,
        )
        info.query_embedding = response.data[0].embedding
        print(f"[2] Embedding dim: {len(info.query_embedding)}")

        # Stage 3: Vector search
        results = await self.vector_store.query(
            embedding=info.query_embedding,
            top_k=top_k,
        )
        info.raw_results = results
        info.similarity_scores = [r["score"] for r in results]
        print(f"[3] Raw results: {len(results)}")
        for i, r in enumerate(results):
            print(f"    [{i}] score={r['score']:.4f} | {r['text'][:80]}...")

        # Stage 4: Filtering
        info.filtered_results = [
            r for r in results if r["score"] >= threshold
        ]
        print(f"[4] After filter (>={threshold}): {len(info.filtered_results)}")

        return info

Diagnosing Query-Document Mismatch

The most common RAG failure is a semantic gap between the query and the stored chunks. The user asks one thing, but the embedding model interprets it differently:

async def diagnose_query_mismatch(
    debugger, query: str, expected_doc_ids: list[str]
):
    """Check if expected documents score higher than retrieved ones."""
    info = await debugger.debug_retrieve(query, top_k=20)

    retrieved_ids = {r["id"] for r in info.raw_results}
    expected_set = set(expected_doc_ids)

    found = expected_set & retrieved_ids
    missed = expected_set - retrieved_ids

    print(f"Expected docs found in top-20: {len(found)}/{len(expected_set)}")
    if missed:
        print(f"Missing doc IDs: {missed}")
        # Fetch embeddings for missing docs and compute similarity
        for doc_id in missed:
            doc = await debugger.vector_store.get_by_id(doc_id)
            if doc:
                doc_emb = doc["embedding"]
                query_emb = np.array(info.query_embedding)
                similarity = np.dot(query_emb, np.array(doc_emb)) / (
                    np.linalg.norm(query_emb) * np.linalg.norm(doc_emb)
                )
                print(f"  {doc_id}: similarity={similarity:.4f}")
                print(f"    Content: {doc['text'][:100]}...")

Inspecting Chunk Quality

Bad chunking is a silent killer of RAG accuracy. Chunks that split important information across boundaries lose semantic coherence:

class ChunkQualityAnalyzer:
    def __init__(self, embedding_client):
        self.client = embedding_client

    async def analyze_chunks(self, chunks: list[str], query: str):
        """Score each chunk for self-containedness and relevance."""
        # Embed query and all chunks
        all_texts = [query] + chunks
        response = await self.client.embeddings.create(
            model="text-embedding-3-small",
            input=all_texts,
        )
        embeddings = [d.embedding for d in response.data]
        query_emb = np.array(embeddings[0])

        print(f"Analyzing {len(chunks)} chunks against query")
        print("-" * 60)

        for i, chunk in enumerate(chunks):
            chunk_emb = np.array(embeddings[i + 1])
            similarity = float(np.dot(query_emb, chunk_emb) / (
                np.linalg.norm(query_emb) * np.linalg.norm(chunk_emb)
            ))
            word_count = len(chunk.split())
            has_incomplete_sentence = (
                not chunk.strip().endswith((".", "!", "?", '."', ".'"))
            )

            print(f"Chunk {i}: similarity={similarity:.4f}, "
                  f"words={word_count}, "
                  f"incomplete={'YES' if has_incomplete_sentence else 'no'}")
            if has_incomplete_sentence:
                print(f"  Ends with: ...{chunk[-60:]}")

Testing with Known-Good Queries

Build a test suite of queries with expected document matches to catch retrieval regressions:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

class RAGTestSuite:
    def __init__(self, debugger):
        self.debugger = debugger
        self.test_cases = []

    def add_case(self, query: str, expected_doc_ids: list[str], threshold=0.7):
        self.test_cases.append({
            "query": query,
            "expected": expected_doc_ids,
            "threshold": threshold,
        })

    async def run(self):
        results = []
        for case in self.test_cases:
            info = await self.debugger.debug_retrieve(
                case["query"], top_k=10, threshold=case["threshold"]
            )
            retrieved_ids = {r["id"] for r in info.filtered_results}
            expected = set(case["expected"])
            recall = len(expected & retrieved_ids) / len(expected) if expected else 1.0

            results.append({
                "query": case["query"],
                "recall": recall,
                "pass": recall >= 0.8,
            })
            status = "PASS" if recall >= 0.8 else "FAIL"
            print(f"[{status}] recall={recall:.0%} | {case['query'][:60]}")
        return results

FAQ

This is a precision problem. Increase your similarity threshold to filter out loosely related chunks. Also consider using a reranker model as a second-stage filter — cross-encoder rerankers like Cohere Rerank or BGE Reranker evaluate query-document pairs more accurately than cosine similarity on embeddings alone.

Should I embed the user question directly or rewrite it before searching?

Query rewriting often improves retrieval significantly. Use the LLM to expand abbreviations, resolve pronouns from conversation history, and rephrase colloquial language into terminology that matches your documents. A simple rewriting step can increase recall by 20 to 40 percent.

How do I decide the right chunk size for my documents?

There is no universal answer — it depends on your content. Start with 500 to 800 tokens with 100-token overlap. Test with your actual queries and measure recall. If chunks are too small, they lack context. If too large, they dilute relevance. Technical documentation often benefits from smaller chunks while narrative content works better with larger ones.

#Debugging #RAG #Embeddings #VectorSearch #AIAgents #AgenticAI #LearnAI #AIEngineering

Debugging RAG Retrieval: When the Agent Retrieves Wrong or Irrelevant Documents

The Right Question, the Wrong Answer

The RAG Retrieval Pipeline

Diagnosing Query-Document Mismatch

Inspecting Chunk Quality

Testing with Known-Good Queries

FAQ

Should I embed the user question directly or rewrite it before searching?

How do I decide the right chunk size for my documents?

Try CallSphere AI Voice Agents

Related Articles You May Like

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

The Right Question, the Wrong Answer

The RAG Retrieval Pipeline

Diagnosing Query-Document Mismatch

Inspecting Chunk Quality

Testing with Known-Good Queries

FAQ

My RAG retrieves documents that are topically related but do not answer the specific question. How do I fix this?

Should I embed the user question directly or rewrite it before searching?

How do I decide the right chunk size for my documents?

Try CallSphere AI Voice Agents

Related Articles You May Like

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

Production RAG Agents with LangChain and RAGAS Evaluation in 2026