Skip to content
Learn Agentic AI
Learn Agentic AI14 min read20 views

Benchmarking Vector Databases: Latency, Throughput, and Recall at Scale

Learn how to rigorously benchmark vector databases with proper methodology — measuring latency, throughput, and recall under realistic conditions to make informed infrastructure decisions.

Why Benchmark Your Own Workload

Vendor benchmarks are marketing. They show optimal configurations on favorable datasets under ideal conditions. Your application has specific embedding dimensions, query patterns, filter complexity, and concurrency levels that no generic benchmark captures.

The only benchmark that matters is one that simulates your actual workload. This guide covers the methodology, metrics, and tooling to run rigorous vector database benchmarks that inform real infrastructure decisions.

The Three Metrics That Matter

1. Recall at K — What fraction of the true nearest neighbors does the system return? Recall of 0.95 at K=10 means 9.5 out of 10 true neighbors are found.

flowchart TD
    DOC(["Document"])
    CHUNK["Chunker<br/>recursive plus overlap"]
    EMB["Embedding model"]
    META["Attach metadata<br/>source, page, tenant"]
    INDEX[("HNSW or IVF index<br/>in vector store")]
    Q(["Query"])
    QEMB["Embed query"]
    SEARCH["ANN search<br/>cosine similarity"]
    FILTER["Metadata filter<br/>tenant or date"]
    HITS(["Top-k chunks"])
    DOC --> CHUNK --> EMB --> META --> INDEX
    Q --> QEMB --> SEARCH
    INDEX --> SEARCH --> FILTER --> HITS
    style INDEX fill:#4f46e5,stroke:#4338ca,color:#fff
    style HITS fill:#059669,stroke:#047857,color:#fff

2. Query Latency — How long does a single query take? Measure P50, P95, and P99 — averages hide tail latency that affects user experience.

3. Queries Per Second (QPS) — How many concurrent queries can the system handle before latency degrades? This determines how many users your system can serve.

These three metrics are in tension. Higher recall requires searching more candidates, which increases latency and reduces throughput. Every index configuration is a point on this three-way tradeoff surface.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Building a Benchmark Suite

Start with a reproducible benchmark framework:

import time
import numpy as np
from dataclasses import dataclass, field

@dataclass
class BenchmarkResult:
    recall_at_k: float
    latencies_ms: list[float] = field(default_factory=list)

    @property
    def p50_ms(self) -> float:
        return float(np.percentile(self.latencies_ms, 50))

    @property
    def p95_ms(self) -> float:
        return float(np.percentile(self.latencies_ms, 95))

    @property
    def p99_ms(self) -> float:
        return float(np.percentile(self.latencies_ms, 99))

    @property
    def qps(self) -> float:
        total_seconds = sum(self.latencies_ms) / 1000.0
        return len(self.latencies_ms) / total_seconds if total_seconds > 0 else 0

Computing Ground Truth

To measure recall, you need exact nearest neighbors as ground truth. Generate these with brute-force search:

import faiss

def compute_ground_truth(
    vectors: np.ndarray,
    queries: np.ndarray,
    k: int = 10
) -> np.ndarray:
    """Compute exact nearest neighbors using brute-force search."""
    dimension = vectors.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(vectors)
    distances, indices = index.search(queries, k)
    return indices  # shape: (num_queries, k)

Measuring Recall

Compare ANN results against ground truth:

def compute_recall(
    ann_results: list[list[int]],
    ground_truth: np.ndarray,
    k: int = 10
) -> float:
    """Compute recall@k: fraction of true neighbors found."""
    total_recall = 0.0
    for i, ann_ids in enumerate(ann_results):
        true_ids = set(ground_truth[i][:k])
        found = len(set(ann_ids[:k]) & true_ids)
        total_recall += found / k
    return total_recall / len(ann_results)

Benchmarking pgvector

import psycopg
from pgvector.psycopg import register_vector

def benchmark_pgvector(
    conn,
    queries: np.ndarray,
    ground_truth: np.ndarray,
    k: int = 10,
    ef_search: int = 40
) -> BenchmarkResult:
    register_vector(conn)
    conn.execute(f"SET hnsw.ef_search = {ef_search}")

    latencies = []
    all_results = []

    for query_vec in queries:
        start = time.perf_counter()
        rows = conn.execute(
            "SELECT id FROM documents ORDER BY embedding <=> %s LIMIT %s",
            (query_vec.tolist(), k)
        ).fetchall()
        elapsed_ms = (time.perf_counter() - start) * 1000

        latencies.append(elapsed_ms)
        all_results.append([row[0] for row in rows])

    recall = compute_recall(all_results, ground_truth, k)
    return BenchmarkResult(recall_at_k=recall, latencies_ms=latencies)

Benchmarking Pinecone

from pinecone import Pinecone

def benchmark_pinecone(
    index,
    queries: np.ndarray,
    ground_truth: np.ndarray,
    k: int = 10
) -> BenchmarkResult:
    latencies = []
    all_results = []

    for query_vec in queries:
        start = time.perf_counter()
        response = index.query(
            vector=query_vec.tolist(),
            top_k=k
        )
        elapsed_ms = (time.perf_counter() - start) * 1000

        latencies.append(elapsed_ms)
        result_ids = [int(m["id"]) for m in response["matches"]]
        all_results.append(result_ids)

    recall = compute_recall(all_results, ground_truth, k)
    return BenchmarkResult(recall_at_k=recall, latencies_ms=latencies)

Concurrent Load Testing

Single-query latency tells only part of the story. Test under concurrent load to find throughput limits:

import concurrent.futures

def concurrent_benchmark(
    search_fn,
    queries: np.ndarray,
    concurrency: int = 10
) -> dict:
    latencies = []

    def run_query(query_vec):
        start = time.perf_counter()
        search_fn(query_vec)
        return (time.perf_counter() - start) * 1000

    start_all = time.perf_counter()

    with concurrent.futures.ThreadPoolExecutor(
        max_workers=concurrency
    ) as executor:
        futures = [
            executor.submit(run_query, q)
            for q in queries
        ]
        for future in concurrent.futures.as_completed(futures):
            latencies.append(future.result())

    total_time = time.perf_counter() - start_all
    return {
        "concurrency": concurrency,
        "total_queries": len(queries),
        "total_time_s": total_time,
        "qps": len(queries) / total_time,
        "p50_ms": float(np.percentile(latencies, 50)),
        "p95_ms": float(np.percentile(latencies, 95)),
        "p99_ms": float(np.percentile(latencies, 99)),
    }

Running a Sweep

Test multiple configurations to find the optimal recall-latency tradeoff:

def parameter_sweep_pgvector(conn, queries, ground_truth):
    results = []
    for ef_search in [10, 20, 40, 80, 160, 320]:
        result = benchmark_pgvector(
            conn, queries, ground_truth,
            k=10, ef_search=ef_search
        )
        results.append({
            "ef_search": ef_search,
            "recall": result.recall_at_k,
            "p50_ms": result.p50_ms,
            "p95_ms": result.p95_ms,
            "qps": result.qps,
        })
        print(
            f"ef_search={ef_search}: "
            f"recall={result.recall_at_k:.3f}, "
            f"p50={result.p50_ms:.1f}ms, "
            f"p95={result.p95_ms:.1f}ms"
        )
    return results

Benchmarking Best Practices

Use realistic data. Random vectors behave differently from real embeddings. Use a subset of your actual production embeddings or a standard benchmark dataset like ANN-Benchmarks (sift-128, gist-960, or deep-96).

Warm up before measuring. Run 100-200 throwaway queries to fill caches and warm JIT-compiled code paths. Only measure after warmup.

Test with filters. If your application uses metadata filtering, include filters in your benchmark. Filtered search performance can differ dramatically from unfiltered.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Measure at your target scale. Performance at 100K vectors does not predict performance at 10M vectors. Load your benchmark with the volume you expect in production.

Run multiple trials. Network variability (especially for cloud databases) can skew individual measurements. Run each configuration 3-5 times and report the median.

Real-World Performance Expectations

Based on publicly available benchmarks and community reports for 1M vectors at 1536 dimensions with HNSW:

Database P50 Latency Recall@10 QPS (single client)
pgvector (PostgreSQL 16) 3-8ms 0.95-0.99 200-500
Pinecone (serverless) 10-30ms 0.95+ 100-300
Weaviate (self-hosted) 2-5ms 0.95-0.99 300-800
Chroma (self-hosted) 5-15ms 0.95+ 100-400

These numbers vary significantly based on hardware, index configuration, and query complexity. Always benchmark your own workload.

FAQ

How many queries should I run to get statistically meaningful benchmark results?

At minimum, run 1,000 queries per configuration. For latency percentiles (P95, P99), you need at least 10,000 queries to get stable measurements. Use different query vectors for each run — repeating the same queries can bias results due to caching effects.

Should I benchmark with or without metadata filters?

Both. Run a baseline without filters to understand raw vector search performance, then add filters that match your production query patterns. The performance gap between filtered and unfiltered search reveals how much overhead your filter strategy adds, which helps you design better metadata schemas.

How do I compare self-hosted vs managed vector databases fairly?

Match the compute resources. If your self-hosted pgvector runs on a 4-core, 16GB machine, compare it against a similarly sized managed instance, not the vendor's top-tier offering. Also account for operational costs — the managed service includes monitoring, backups, and scaling that you would need to build yourself.


#Benchmarking #VectorDatabase #Performance #Latency #Recall #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Infrastructure

WebRTC Over QUIC and the Future of Realtime: Where Voice AI Goes After 2026

WebTransport is Baseline as of March 2026. Media Over QUIC ships in production within the year. Here is what changes for AI voice agents — and what stays the same.

AI Engineering

Latency Benchmarking AI Voice Agent Vendors (2026)

Vapi 465ms optimal, Retell 580-620ms, Bland ~800ms, ElevenLabs 400-600ms — but those are best-case. We design a fair benchmark harness, P95 measurement, and a reproducible methodology for 2026.

AI Engineering

Latency vs Cost: A Decision Matrix for Voice AI Spend in 2026

Every 100ms of latency costs you. So does every cent per minute. Here is the decision matrix we use across 6 verticals to pick where to spend and where to save on voice AI infrastructure.

Agentic AI

Streaming Agent Responses with OpenAI Agents SDK and LangChain in 2026

How to stream tokens, tool-call deltas, and intermediate steps from an agent — with code for both the OpenAI Agents SDK and LangChain — and the gotchas that bite in production.

Agentic AI

Token-Level Evaluation of Streaming Agents: TTFT, Stream Smoothness, and Mid-Stream Hallucination Detection

Streaming changes the eval game — final-answer correctness isn't enough when users perceive the answer one token at a time. Here's the metric set that matters.

AI Infrastructure

OpenAI's May 2026 WebRTC Rearchitecture: How Voice Latency Got Real

On May 4 2026 OpenAI published its Realtime stack rebuild — split-relay plus transceiver edge. Here is what changed and what it means for production voice agents.