Skip to content
Learn Agentic AI
Learn Agentic AI12 min read4 views

Debugging RAG Retrieval: When the Agent Retrieves Wrong or Irrelevant Documents

Learn systematic approaches to debugging RAG retrieval failures including query analysis, embedding inspection, relevance scoring evaluation, and chunk quality review for more accurate AI agent responses.

The Right Question, the Wrong Answer

Your RAG-powered agent has access to thousands of documents. A user asks a straightforward question. The agent retrieves three chunks, synthesizes a response, and delivers it confidently. The response is wrong — not because the model hallucinated, but because it was given the wrong documents to work with.

RAG retrieval failures are particularly dangerous because the agent has no way to know it retrieved bad chunks. It trusts what it receives and generates a plausible-sounding answer from irrelevant source material. Debugging this requires inspecting every stage of the retrieval pipeline.

The RAG Retrieval Pipeline

Every RAG query passes through four stages, and failures can occur at each one:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart LR
    Q(["User query"])
    EMB["Embed query<br/>text-embedding-3"]
    VEC[("Vector DB<br/>pgvector or Pinecone")]
    RET["Top-k retrieval<br/>k = 8"]
    PROMPT["Augmented prompt<br/>system plus context"]
    LLM["LLM generation<br/>Claude or GPT"]
    CITE["Inline citations<br/>and page anchors"]
    OUT(["Grounded answer"])
    Q --> EMB --> VEC --> RET --> PROMPT --> LLM --> CITE --> OUT
    style EMB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style VEC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style OUT fill:#059669,stroke:#047857,color:#fff
  1. Query formation: The user question is transformed into a search query
  2. Embedding: The query is converted to a vector
  3. Vector search: The nearest neighbor chunks are retrieved
  4. Relevance filtering: Results below a threshold are discarded

Build a debugger that captures data at every stage:

import numpy as np
from dataclasses import dataclass, field

@dataclass
class RetrievalDebugInfo:
    original_query: str = ""
    search_query: str = ""
    query_embedding: list[float] = field(default_factory=list)
    raw_results: list[dict] = field(default_factory=list)
    filtered_results: list[dict] = field(default_factory=list)
    similarity_scores: list[float] = field(default_factory=list)

class RAGDebugger:
    def __init__(self, embedding_client, vector_store):
        self.embedding_client = embedding_client
        self.vector_store = vector_store

    async def debug_retrieve(
        self,
        query: str,
        top_k: int = 5,
        threshold: float = 0.7,
    ) -> RetrievalDebugInfo:
        info = RetrievalDebugInfo(original_query=query)

        # Stage 1: Query formation
        info.search_query = query  # or apply transformation
        print(f"[1] Query: {info.search_query}")

        # Stage 2: Embedding
        response = await self.embedding_client.embeddings.create(
            model="text-embedding-3-small",
            input=info.search_query,
        )
        info.query_embedding = response.data[0].embedding
        print(f"[2] Embedding dim: {len(info.query_embedding)}")

        # Stage 3: Vector search
        results = await self.vector_store.query(
            embedding=info.query_embedding,
            top_k=top_k,
        )
        info.raw_results = results
        info.similarity_scores = [r["score"] for r in results]
        print(f"[3] Raw results: {len(results)}")
        for i, r in enumerate(results):
            print(f"    [{i}] score={r['score']:.4f} | {r['text'][:80]}...")

        # Stage 4: Filtering
        info.filtered_results = [
            r for r in results if r["score"] >= threshold
        ]
        print(f"[4] After filter (>={threshold}): {len(info.filtered_results)}")

        return info

Diagnosing Query-Document Mismatch

The most common RAG failure is a semantic gap between the query and the stored chunks. The user asks one thing, but the embedding model interprets it differently:

async def diagnose_query_mismatch(
    debugger, query: str, expected_doc_ids: list[str]
):
    """Check if expected documents score higher than retrieved ones."""
    info = await debugger.debug_retrieve(query, top_k=20)

    retrieved_ids = {r["id"] for r in info.raw_results}
    expected_set = set(expected_doc_ids)

    found = expected_set & retrieved_ids
    missed = expected_set - retrieved_ids

    print(f"Expected docs found in top-20: {len(found)}/{len(expected_set)}")
    if missed:
        print(f"Missing doc IDs: {missed}")
        # Fetch embeddings for missing docs and compute similarity
        for doc_id in missed:
            doc = await debugger.vector_store.get_by_id(doc_id)
            if doc:
                doc_emb = doc["embedding"]
                query_emb = np.array(info.query_embedding)
                similarity = np.dot(query_emb, np.array(doc_emb)) / (
                    np.linalg.norm(query_emb) * np.linalg.norm(doc_emb)
                )
                print(f"  {doc_id}: similarity={similarity:.4f}")
                print(f"    Content: {doc['text'][:100]}...")

Inspecting Chunk Quality

Bad chunking is a silent killer of RAG accuracy. Chunks that split important information across boundaries lose semantic coherence:

class ChunkQualityAnalyzer:
    def __init__(self, embedding_client):
        self.client = embedding_client

    async def analyze_chunks(self, chunks: list[str], query: str):
        """Score each chunk for self-containedness and relevance."""
        # Embed query and all chunks
        all_texts = [query] + chunks
        response = await self.client.embeddings.create(
            model="text-embedding-3-small",
            input=all_texts,
        )
        embeddings = [d.embedding for d in response.data]
        query_emb = np.array(embeddings[0])

        print(f"Analyzing {len(chunks)} chunks against query")
        print("-" * 60)

        for i, chunk in enumerate(chunks):
            chunk_emb = np.array(embeddings[i + 1])
            similarity = float(np.dot(query_emb, chunk_emb) / (
                np.linalg.norm(query_emb) * np.linalg.norm(chunk_emb)
            ))
            word_count = len(chunk.split())
            has_incomplete_sentence = (
                not chunk.strip().endswith((".", "!", "?", '."', ".'"))
            )

            print(f"Chunk {i}: similarity={similarity:.4f}, "
                  f"words={word_count}, "
                  f"incomplete={'YES' if has_incomplete_sentence else 'no'}")
            if has_incomplete_sentence:
                print(f"  Ends with: ...{chunk[-60:]}")

Testing with Known-Good Queries

Build a test suite of queries with expected document matches to catch retrieval regressions:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

class RAGTestSuite:
    def __init__(self, debugger):
        self.debugger = debugger
        self.test_cases = []

    def add_case(self, query: str, expected_doc_ids: list[str], threshold=0.7):
        self.test_cases.append({
            "query": query,
            "expected": expected_doc_ids,
            "threshold": threshold,
        })

    async def run(self):
        results = []
        for case in self.test_cases:
            info = await self.debugger.debug_retrieve(
                case["query"], top_k=10, threshold=case["threshold"]
            )
            retrieved_ids = {r["id"] for r in info.filtered_results}
            expected = set(case["expected"])
            recall = len(expected & retrieved_ids) / len(expected) if expected else 1.0

            results.append({
                "query": case["query"],
                "recall": recall,
                "pass": recall >= 0.8,
            })
            status = "PASS" if recall >= 0.8 else "FAIL"
            print(f"[{status}] recall={recall:.0%} | {case['query'][:60]}")
        return results

FAQ

This is a precision problem. Increase your similarity threshold to filter out loosely related chunks. Also consider using a reranker model as a second-stage filter — cross-encoder rerankers like Cohere Rerank or BGE Reranker evaluate query-document pairs more accurately than cosine similarity on embeddings alone.

Should I embed the user question directly or rewrite it before searching?

Query rewriting often improves retrieval significantly. Use the LLM to expand abbreviations, resolve pronouns from conversation history, and rephrase colloquial language into terminology that matches your documents. A simple rewriting step can increase recall by 20 to 40 percent.

How do I decide the right chunk size for my documents?

There is no universal answer — it depends on your content. Start with 500 to 800 tokens with 100-token overlap. Test with your actual queries and measure recall. If chunks are too small, they lack context. If too large, they dilute relevance. Technical documentation often benefits from smaller chunks while narrative content works better with larger ones.


#Debugging #RAG #Embeddings #VectorSearch #AIAgents #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.

Agentic AI

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Beyond single-shot RAG — agentic RAG with LangGraph that re-retrieves, self-grades, and rewrites queries. With evals that catch silent retrieval drift.

Agentic AI

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.

Agentic AI

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

Use LangGraph's checkpointer to make agents resumable across crashes and human-in-the-loop pauses, then replay any checkpoint into your eval pipeline.

Agentic AI

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

How LangGraph's StateGraph, channels, and reducers actually work — with a working multi-step agent, eval hooks at every node, and the patterns that survive production.

Agentic AI

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

Build a production RAG agent with LangChain, then measure faithfulness, answer relevance, and context precision with RAGAS. The four metrics that matter and how to wire them up.