Re-Ranking Search Results with Cross-Encoders: Improving Retrieval Precision

The Precision Problem in First-Stage Retrieval

Bi-encoder models (like sentence-transformers) embed queries and documents independently, then compare them with cosine similarity. This independence is what makes them fast — you can pre-compute document embeddings — but it also limits their accuracy. A bi-encoder cannot model fine-grained interactions between specific query terms and specific document phrases.

Cross-encoders solve this by processing the query and document together as a single input pair, allowing the transformer's attention layers to directly compare every query token against every document token. The result is significantly higher precision, at the cost of speed.

Bi-Encoder vs Cross-Encoder

The key architectural difference:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness<br/>PromptFoo or Braintrust"]
    GOLD[("Golden set<br/>200 tagged cases")]
    JUDGE["LLM as judge<br/>plus regex graders"]
    SCORE["Aggregate score<br/>and per slice"]
    GATE{"Score regress<br/>more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff

Bi-encoder: Embeds query and document separately, compares with dot product. Fast (pre-compute docs), but lower precision.
Cross-encoder: Concatenates query + document, passes through transformer together, outputs a single relevance score. Slow (must run for each pair), but much higher precision.

The standard pattern is a two-stage pipeline: use a bi-encoder to retrieve the top 50-100 candidates quickly, then re-rank those candidates with a cross-encoder.

Building the Re-Ranking Pipeline

from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
from typing import List, Dict, Tuple

class TwoStageSearchPipeline:
    def __init__(
        self,
        bi_encoder_name: str = "all-MiniLM-L6-v2",
        cross_encoder_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
    ):
        self.bi_encoder = SentenceTransformer(bi_encoder_name)
        self.cross_encoder = CrossEncoder(cross_encoder_name)
        self.doc_embeddings = None
        self.documents = []

    def index_documents(self, documents: List[Dict]):
        """Pre-compute bi-encoder embeddings for all documents."""
        self.documents = documents
        texts = [f"{d['title']}. {d['body']}" for d in documents]
        self.doc_embeddings = self.bi_encoder.encode(
            texts, normalize_embeddings=True, show_progress_bar=True
        )

    def first_stage_retrieve(
        self, query: str, top_k: int = 50
    ) -> List[Tuple[int, float]]:
        """Fast retrieval using bi-encoder similarity."""
        query_emb = self.bi_encoder.encode(
            [query], normalize_embeddings=True
        )
        scores = np.dot(self.doc_embeddings, query_emb.T).flatten()
        top_indices = np.argsort(scores)[::-1][:top_k]
        return [(idx, scores[idx]) for idx in top_indices]

    def re_rank(
        self, query: str, candidates: List[Tuple[int, float]], top_k: int = 10
    ) -> List[Dict]:
        """Re-rank candidates using cross-encoder."""
        pairs = []
        for idx, _ in candidates:
            doc = self.documents[idx]
            text = f"{doc['title']}. {doc['body']}"
            pairs.append((query, text))

        # Cross-encoder scores all pairs jointly
        ce_scores = self.cross_encoder.predict(pairs)

        # Sort by cross-encoder score
        scored = list(zip(candidates, ce_scores))
        scored.sort(key=lambda x: x[1], reverse=True)

        results = []
        for (idx, bi_score), ce_score in scored[:top_k]:
            doc = self.documents[idx].copy()
            doc["bi_encoder_score"] = float(bi_score)
            doc["cross_encoder_score"] = float(ce_score)
            results.append(doc)
        return results

    def search(self, query: str, retrieve_k: int = 50, final_k: int = 10):
        candidates = self.first_stage_retrieve(query, top_k=retrieve_k)
        return self.re_rank(query, candidates, top_k=final_k)

Choosing the Right Cross-Encoder Model

Model selection depends on your latency budget:

# Model comparison (approximate, on CPU)
CROSS_ENCODER_MODELS = {
    # Model name: (params, ms/pair, nDCG@10 on MS MARCO)
    "cross-encoder/ms-marco-TinyBERT-L-2-v2": ("4.4M", 1.5, 0.325),
    "cross-encoder/ms-marco-MiniLM-L-6-v2": ("22.7M", 4.0, 0.349),
    "cross-encoder/ms-marco-MiniLM-L-12-v2": ("33.4M", 8.0, 0.357),
    "cross-encoder/ms-marco-electra-base": ("109M", 12.0, 0.365),
}

def select_model(latency_budget_ms: float, num_candidates: int) -> str:
    """Select the best model that fits within the latency budget."""
    for name, (params, ms_per_pair, quality) in sorted(
        CROSS_ENCODER_MODELS.items(),
        key=lambda x: x[1][2],
        reverse=True,  # prefer higher quality
    ):
        total_latency = ms_per_pair * num_candidates
        if total_latency <= latency_budget_ms:
            return name
    return "cross-encoder/ms-marco-TinyBERT-L-2-v2"  # fallback

Managing Latency

Cross-encoders are expensive. Re-ranking 100 candidates with a 12-layer model at 8ms per pair takes 800ms. Strategies to reduce this:

Reduce candidate count — retrieve 30-50 instead of 100. Diminishing returns beyond the top 50.
Use smaller models — TinyBERT at 1.5ms/pair re-ranks 50 candidates in 75ms.
Batch on GPU — GPU batching drops per-pair time by 10x.
Cache re-ranked results — popular queries hit the same documents repeatedly.

from functools import lru_cache
import hashlib

class CachedReRanker:
    def __init__(self, cross_encoder: CrossEncoder, cache_size: int = 1024):
        self.cross_encoder = cross_encoder
        self._cache = {}
        self.cache_size = cache_size

    def _cache_key(self, query: str, doc_text: str) -> str:
        combined = f"{query}|||{doc_text}"
        return hashlib.md5(combined.encode()).hexdigest()

    def predict(self, pairs: list) -> list:
        scores = []
        uncached_pairs = []
        uncached_indices = []
        for i, (query, doc) in enumerate(pairs):
            key = self._cache_key(query, doc)
            if key in self._cache:
                scores.append(self._cache[key])
            else:
                scores.append(None)
                uncached_pairs.append((query, doc))
                uncached_indices.append(i)

        if uncached_pairs:
            new_scores = self.cross_encoder.predict(uncached_pairs)
            for idx, score in zip(uncached_indices, new_scores):
                key = self._cache_key(*pairs[idx])
                self._cache[key] = float(score)
                scores[idx] = float(score)

        return scores

Measuring the Impact

Re-ranking typically improves nDCG@10 by 15-30% over bi-encoder-only retrieval. The improvement is most pronounced for ambiguous or complex queries where surface-level similarity is misleading.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

FAQ

When should I skip re-ranking and use only a bi-encoder?

Skip re-ranking when latency is critical (under 50ms), when your corpus is small enough that a flat exact search is already precise, or when queries are simple keyword lookups. Re-ranking shines on natural language questions and long-form queries where nuance matters.

Can I fine-tune a cross-encoder on my own data?

Yes, and it is one of the highest-impact improvements you can make. Collect query-document relevance pairs from click logs or manual annotations. Even 1,000-2,000 labeled pairs can significantly boost domain-specific precision. Use the sentence-transformers training API with CrossEncoder.fit().

How many candidates should the first stage retrieve for re-ranking?

Start with 50 candidates. Going beyond 100 rarely improves final results because relevant documents almost always appear in the top 50 of a decent bi-encoder. Profile your pipeline to find the sweet spot between recall and re-ranking latency.

#CrossEncoder #ReRanking #SemanticSearch #InformationRetrieval #NLP #AgenticAI #LearnAI #AIEngineering

Re-Ranking Search Results with Cross-Encoders: Improving Retrieval Precision

The Precision Problem in First-Stage Retrieval

Bi-Encoder vs Cross-Encoder

Building the Re-Ranking Pipeline

Choosing the Right Cross-Encoder Model

Managing Latency

Measuring the Impact

FAQ

When should I skip re-ranking and use only a bi-encoder?

Can I fine-tune a cross-encoder on my own data?

How many candidates should the first stage retrieve for re-ranking?

Try CallSphere AI Voice Agents

Related Articles You May Like

Post-Call Analytics with GPT-4o-mini: Sentiment, Lead Scoring, and Intent

Semantic Search for AI Agents: Embedding Models, Chunking Strategies, and Retrieval Optimization

Advanced RAG for AI Agents 2026: Hybrid Search, Re-Ranking, and Agentic Retrieval

AI-Powered Document Comparison: Redline Generation and Change Tracking with Vision

Building a Knowledge Graph Construction Agent: Extracting Entities and Relations from Documents

Embeddings and Vector Representations: How LLMs Understand Meaning