Corrective RAG: Self-Correcting Retrieval with Relevance Checking and Web Fallback

The Problem CRAG Solves

Standard RAG has a silent failure mode: when the retriever returns irrelevant documents, the LLM either hallucinates an answer based on unrelated context or produces a vague response. The user has no way to know the retrieval failed because the system confidently presents whatever it generates.

Corrective RAG (CRAG) adds a quality gate between retrieval and generation. After retrieving documents, a relevance evaluator scores each result. If scores are high, generation proceeds normally. If scores are low, the system triggers corrective actions — rewriting the query, searching alternative sources, or falling back to web search.

This simple addition dramatically improves answer quality because most RAG failures originate in the retrieval step, not the generation step. Fix retrieval, and generation quality follows.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

The CRAG Pipeline

The corrective RAG pipeline has four stages:

flowchart LR
    Q(["User query"])
    EMB["Embed query<br/>text-embedding-3"]
    VEC[("Vector DB<br/>pgvector or Pinecone")]
    RET["Top-k retrieval<br/>k = 8"]
    PROMPT["Augmented prompt<br/>system plus context"]
    LLM["LLM generation<br/>Claude or GPT"]
    CITE["Inline citations<br/>and page anchors"]
    OUT(["Grounded answer"])
    Q --> EMB --> VEC --> RET --> PROMPT --> LLM --> CITE --> OUT
    style EMB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style VEC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style OUT fill:#059669,stroke:#047857,color:#fff

Initial retrieval — Standard vector search returns top-K documents
Relevance evaluation — Each document is scored for relevance to the query
Corrective action — Based on scores, the system decides: proceed, refine, or fall back
Generation — Only verified-relevant context reaches the LLM

Full Implementation

from openai import OpenAI
from dataclasses import dataclass
from enum import Enum

client = OpenAI()

class RelevanceLevel(Enum):
    CORRECT = "correct"
    AMBIGUOUS = "ambiguous"
    INCORRECT = "incorrect"

@dataclass
class ScoredDocument:
    content: str
    relevance: RelevanceLevel
    score: float

def evaluate_relevance(
    query: str, document: str
) -> tuple[RelevanceLevel, float]:
    """Score a retrieved document for relevance to the query."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": """Rate the relevance of the document
            to the query. Return JSON:
            {"relevance": "correct|ambiguous|incorrect",
             "score": 0.0-1.0,
             "reasoning": "brief explanation"}"""
        }, {
            "role": "user",
            "content": f"Query: {query}\nDocument: {document}"
        }],
        response_format={"type": "json_object"}
    )
    import json
    result = json.loads(response.choices[0].message.content)
    return (
        RelevanceLevel(result["relevance"]),
        result["score"],
    )

def rewrite_query(original_query: str) -> str:
    """Rewrite the query for better retrieval results."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{
            "role": "system",
            "content": "Rewrite this search query to be more "
                       "specific and likely to retrieve relevant "
                       "documents. Return only the rewritten query."
        }, {
            "role": "user",
            "content": original_query
        }],
    )
    return response.choices[0].message.content

Adding Web Search Fallback

When internal documents are insufficient, CRAG falls back to web search:

import requests

def web_search_fallback(query: str) -> list[str]:
    """Search the web when internal retrieval fails."""
    # Using a search API (Tavily, Serper, or similar)
    response = requests.post(
        "https://api.tavily.com/search",
        json={
            "api_key": "your-tavily-key",
            "query": query,
            "max_results": 5,
            "include_raw_content": True,
        }
    )
    results = response.json().get("results", [])
    return [r["raw_content"][:2000] for r in results]

def corrective_rag(
    query: str,
    retriever,
    relevance_threshold: float = 0.5,
) -> str:
    """Full CRAG pipeline with relevance checking
    and web fallback."""
    # Step 1: Initial retrieval
    raw_docs = retriever.search(query, k=5)

    # Step 2: Evaluate relevance of each document
    scored_docs = []
    for doc in raw_docs:
        level, score = evaluate_relevance(query, doc)
        scored_docs.append(ScoredDocument(doc, level, score))

    # Step 3: Determine corrective action
    relevant = [
        d for d in scored_docs
        if d.relevance == RelevanceLevel.CORRECT
    ]
    ambiguous = [
        d for d in scored_docs
        if d.relevance == RelevanceLevel.AMBIGUOUS
    ]

    if relevant:
        # Enough good context — proceed with relevant docs
        context_docs = [d.content for d in relevant]
    elif ambiguous:
        # Rewrite query and try again
        new_query = rewrite_query(query)
        retry_docs = retriever.search(new_query, k=5)
        context_docs = retry_docs
    else:
        # All irrelevant — fall back to web search
        context_docs = web_search_fallback(query)

    # Step 4: Generate with verified context
    context = "\n\n".join(context_docs)
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{
            "role": "system",
            "content": "Answer the question using only the "
                       "provided context. If the context is "
                       "insufficient, say so clearly."
        }, {
            "role": "user",
            "content": f"Context:\n{context}\n\nQuestion: {query}"
        }],
    )
    return response.choices[0].message.content

Tuning Relevance Thresholds

The relevance evaluator is the heart of CRAG. Set thresholds too high and you trigger unnecessary web searches. Set them too low and irrelevant documents slip through. Start with a threshold of 0.5 and calibrate against a labeled dataset of query-document pairs. Use GPT-4o-mini for evaluation to keep costs low — it is accurate enough for binary relevance judgments and 10x cheaper than GPT-4o.

Production Considerations

In production, log every relevance evaluation with the query, document, and score. This creates a dataset for fine-tuning a smaller, faster relevance model. Track your fallback rate — if more than 20% of queries trigger web search, your knowledge base likely has coverage gaps that should be addressed at the indexing level.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

FAQ

Does the relevance evaluation step add significant latency?

Each evaluation takes 200-400ms with GPT-4o-mini. Since you can evaluate all documents in parallel, the total added latency is roughly one LLM call regardless of how many documents you retrieved. This 300ms investment prevents far costlier failures from irrelevant context.

Can I use a local model for relevance scoring instead of an API?

Yes. A fine-tuned BERT or DeBERTa classifier trained on query-document relevance pairs can score documents in under 10ms each. Start with an LLM-based evaluator to collect training data, then distill it into a local model for production speed.

How does CRAG compare to simply retrieving more documents?

Retrieving more documents increases the chance of finding relevant content but also increases noise. CRAG is more surgical — it retrieves a focused set, evaluates quality, and only expands the search when necessary. This keeps context windows clean and generation quality high.

#CorrectiveRAG #CRAG #RAG #RelevanceScoring #WebSearchFallback #AgenticAI #LearnAI #AIEngineering

Corrective RAG: Self-Correcting Retrieval with Relevance Checking and Web Fallback

The Problem CRAG Solves

The CRAG Pipeline

Full Implementation

Adding Web Search Fallback

Tuning Relevance Thresholds

Production Considerations

FAQ

Does the relevance evaluation step add significant latency?

Can I use a local model for relevance scoring instead of an API?

How does CRAG compare to simply retrieving more documents?

Try CallSphere AI Voice Agents

Related Articles You May Like

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

Cognee: Knowledge-Graph Memory for Agents — A Getting-Started Guide

Enterprise CIO Guide: Retell AI Knowledge Base — RAG Goes Native in Voice

The 200K Context Window That Wasn't: Claude's Effective Memory Tested Under Load