Skip to content
Learn Agentic AI
Learn Agentic AI13 min read27 views

Re-Ranking Search Results with Cross-Encoders: Improving Retrieval Precision

Understand the difference between bi-encoders and cross-encoders, then build a re-ranking pipeline that dramatically improves search precision by scoring query-document pairs jointly rather than independently.

The Precision Problem in First-Stage Retrieval

Bi-encoder models (like sentence-transformers) embed queries and documents independently, then compare them with cosine similarity. This independence is what makes them fast — you can pre-compute document embeddings — but it also limits their accuracy. A bi-encoder cannot model fine-grained interactions between specific query terms and specific document phrases.

Cross-encoders solve this by processing the query and document together as a single input pair, allowing the transformer's attention layers to directly compare every query token against every document token. The result is significantly higher precision, at the cost of speed.

Bi-Encoder vs Cross-Encoder

The key architectural difference:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness<br/>PromptFoo or Braintrust"]
    GOLD[("Golden set<br/>200 tagged cases")]
    JUDGE["LLM as judge<br/>plus regex graders"]
    SCORE["Aggregate score<br/>and per slice"]
    GATE{"Score regress<br/>more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff
  • Bi-encoder: Embeds query and document separately, compares with dot product. Fast (pre-compute docs), but lower precision.
  • Cross-encoder: Concatenates query + document, passes through transformer together, outputs a single relevance score. Slow (must run for each pair), but much higher precision.

The standard pattern is a two-stage pipeline: use a bi-encoder to retrieve the top 50-100 candidates quickly, then re-rank those candidates with a cross-encoder.

Building the Re-Ranking Pipeline

from sentence_transformers import SentenceTransformer, CrossEncoder
import numpy as np
from typing import List, Dict, Tuple

class TwoStageSearchPipeline:
    def __init__(
        self,
        bi_encoder_name: str = "all-MiniLM-L6-v2",
        cross_encoder_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2",
    ):
        self.bi_encoder = SentenceTransformer(bi_encoder_name)
        self.cross_encoder = CrossEncoder(cross_encoder_name)
        self.doc_embeddings = None
        self.documents = []

    def index_documents(self, documents: List[Dict]):
        """Pre-compute bi-encoder embeddings for all documents."""
        self.documents = documents
        texts = [f"{d['title']}. {d['body']}" for d in documents]
        self.doc_embeddings = self.bi_encoder.encode(
            texts, normalize_embeddings=True, show_progress_bar=True
        )

    def first_stage_retrieve(
        self, query: str, top_k: int = 50
    ) -> List[Tuple[int, float]]:
        """Fast retrieval using bi-encoder similarity."""
        query_emb = self.bi_encoder.encode(
            [query], normalize_embeddings=True
        )
        scores = np.dot(self.doc_embeddings, query_emb.T).flatten()
        top_indices = np.argsort(scores)[::-1][:top_k]
        return [(idx, scores[idx]) for idx in top_indices]

    def re_rank(
        self, query: str, candidates: List[Tuple[int, float]], top_k: int = 10
    ) -> List[Dict]:
        """Re-rank candidates using cross-encoder."""
        pairs = []
        for idx, _ in candidates:
            doc = self.documents[idx]
            text = f"{doc['title']}. {doc['body']}"
            pairs.append((query, text))

        # Cross-encoder scores all pairs jointly
        ce_scores = self.cross_encoder.predict(pairs)

        # Sort by cross-encoder score
        scored = list(zip(candidates, ce_scores))
        scored.sort(key=lambda x: x[1], reverse=True)

        results = []
        for (idx, bi_score), ce_score in scored[:top_k]:
            doc = self.documents[idx].copy()
            doc["bi_encoder_score"] = float(bi_score)
            doc["cross_encoder_score"] = float(ce_score)
            results.append(doc)
        return results

    def search(self, query: str, retrieve_k: int = 50, final_k: int = 10):
        candidates = self.first_stage_retrieve(query, top_k=retrieve_k)
        return self.re_rank(query, candidates, top_k=final_k)

Choosing the Right Cross-Encoder Model

Model selection depends on your latency budget:

# Model comparison (approximate, on CPU)
CROSS_ENCODER_MODELS = {
    # Model name: (params, ms/pair, nDCG@10 on MS MARCO)
    "cross-encoder/ms-marco-TinyBERT-L-2-v2": ("4.4M", 1.5, 0.325),
    "cross-encoder/ms-marco-MiniLM-L-6-v2": ("22.7M", 4.0, 0.349),
    "cross-encoder/ms-marco-MiniLM-L-12-v2": ("33.4M", 8.0, 0.357),
    "cross-encoder/ms-marco-electra-base": ("109M", 12.0, 0.365),
}

def select_model(latency_budget_ms: float, num_candidates: int) -> str:
    """Select the best model that fits within the latency budget."""
    for name, (params, ms_per_pair, quality) in sorted(
        CROSS_ENCODER_MODELS.items(),
        key=lambda x: x[1][2],
        reverse=True,  # prefer higher quality
    ):
        total_latency = ms_per_pair * num_candidates
        if total_latency <= latency_budget_ms:
            return name
    return "cross-encoder/ms-marco-TinyBERT-L-2-v2"  # fallback

Managing Latency

Cross-encoders are expensive. Re-ranking 100 candidates with a 12-layer model at 8ms per pair takes 800ms. Strategies to reduce this:

  1. Reduce candidate count — retrieve 30-50 instead of 100. Diminishing returns beyond the top 50.
  2. Use smaller models — TinyBERT at 1.5ms/pair re-ranks 50 candidates in 75ms.
  3. Batch on GPU — GPU batching drops per-pair time by 10x.
  4. Cache re-ranked results — popular queries hit the same documents repeatedly.
from functools import lru_cache
import hashlib

class CachedReRanker:
    def __init__(self, cross_encoder: CrossEncoder, cache_size: int = 1024):
        self.cross_encoder = cross_encoder
        self._cache = {}
        self.cache_size = cache_size

    def _cache_key(self, query: str, doc_text: str) -> str:
        combined = f"{query}|||{doc_text}"
        return hashlib.md5(combined.encode()).hexdigest()

    def predict(self, pairs: list) -> list:
        scores = []
        uncached_pairs = []
        uncached_indices = []
        for i, (query, doc) in enumerate(pairs):
            key = self._cache_key(query, doc)
            if key in self._cache:
                scores.append(self._cache[key])
            else:
                scores.append(None)
                uncached_pairs.append((query, doc))
                uncached_indices.append(i)

        if uncached_pairs:
            new_scores = self.cross_encoder.predict(uncached_pairs)
            for idx, score in zip(uncached_indices, new_scores):
                key = self._cache_key(*pairs[idx])
                self._cache[key] = float(score)
                scores[idx] = float(score)

        return scores

Measuring the Impact

Re-ranking typically improves nDCG@10 by 15-30% over bi-encoder-only retrieval. The improvement is most pronounced for ambiguous or complex queries where surface-level similarity is misleading.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

FAQ

When should I skip re-ranking and use only a bi-encoder?

Skip re-ranking when latency is critical (under 50ms), when your corpus is small enough that a flat exact search is already precise, or when queries are simple keyword lookups. Re-ranking shines on natural language questions and long-form queries where nuance matters.

Can I fine-tune a cross-encoder on my own data?

Yes, and it is one of the highest-impact improvements you can make. Collect query-document relevance pairs from click logs or manual annotations. Even 1,000-2,000 labeled pairs can significantly boost domain-specific precision. Use the sentence-transformers training API with CrossEncoder.fit().

How many candidates should the first stage retrieve for re-ranking?

Start with 50 candidates. Going beyond 100 rarely improves final results because relevant documents almost always appear in the top 50 of a decent bi-encoder. Profile your pipeline to find the sweet spot between recall and re-ranking latency.


#CrossEncoder #ReRanking #SemanticSearch #InformationRetrieval #NLP #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Technical Guides

Post-Call Analytics with GPT-4o-mini: Sentiment, Lead Scoring, and Intent

Build a post-call analytics pipeline with GPT-4o-mini — sentiment, intent, lead scoring, satisfaction, and escalation detection.

Learn Agentic AI

Semantic Search for AI Agents: Embedding Models, Chunking Strategies, and Retrieval Optimization

Comprehensive guide to semantic search for AI agents covering embedding model selection, document chunking strategies, and retrieval optimization techniques for production systems.

Learn Agentic AI

Advanced RAG for AI Agents 2026: Hybrid Search, Re-Ranking, and Agentic Retrieval

Master advanced RAG patterns for AI agents including hybrid vector-keyword search, cross-encoder re-ranking, and agentic retrieval where agents autonomously decide retrieval strategy.

Learn Agentic AI

AI-Powered Document Comparison: Redline Generation and Change Tracking with Vision

Build an AI agent that compares two versions of a document, identifies additions, deletions, and modifications, generates visual redlines, and produces annotated change summaries for legal, contract, and policy review workflows.

Learn Agentic AI

Building a Knowledge Graph Construction Agent: Extracting Entities and Relations from Documents

Build an AI agent that reads documents, extracts named entities and their relationships, constructs a knowledge graph stored in Neo4j, and provides a natural language query interface over the graph.

Learn Agentic AI

Embeddings and Vector Representations: How LLMs Understand Meaning

Learn what embeddings are, how they capture semantic meaning as vectors, how to use embedding models for search and clustering, and the role cosine similarity plays in AI applications.