pgvector HNSW Index Tuning at Scale: m, ef_construction, ef_search (2026)
A measured guide to tuning pgvector HNSW indexes for AI agent workloads — what m, ef_construction, and ef_search actually do, how to size them at 1M, 10M, and 50M rows, and how to monitor recall in production.
TL;DR — Default HNSW params (
m=16,ef_construction=64,ef_search=40) are optimized for 100k-row demos, not 10M-row production. Bumpingef_constructionto 200 andef_searchto 100–200 typically lifts recall@10 from 0.85 to 0.97 with manageable latency cost.
What you'll build
A reproducible benchmark loop that measures recall and p95 latency across HNSW parameter sets, plus a production tuning playbook for 1M, 10M, and 50M-row pgvector tables.
Schema
CREATE TABLE rag_chunks (
id BIGSERIAL PRIMARY KEY,
doc_id UUID NOT NULL,
chunk_text TEXT NOT NULL,
embedding vector(1536) NOT NULL
);
-- Build index AFTER bulk load
CREATE INDEX rag_chunks_hnsw ON rag_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 32, ef_construction = 200);
Architecture
flowchart TD
LOAD[Bulk load 10M chunks] --> IDX[Build HNSW with m=32, ef=200]
IDX --> BENCH[Benchmark loop]
BENCH --> RECALL[Measure recall@10]
BENCH --> P95[Measure p95 latency]
RECALL --> TUNE{Recall > 0.95?}
P95 --> TUNE
TUNE -->|No| EFUP[Raise ef_search]
TUNE -->|Yes| SHIP[Ship config]
Step 1 — Understand the three knobs
m— neighbors per node. Default 16. Higher m = better recall, larger index, slower build. For 10M+ vectors set m = 24–32.ef_construction— candidate list during build. Default 64. Production: 128–200. Affects build time, not query time.ef_search— candidate list during query. Default 40. Production: 80–200. Linear knob: latency vs recall.
Step 2 — Build with parallel workers
SET maintenance_work_mem = '8GB';
SET max_parallel_maintenance_workers = 7;
CREATE INDEX CONCURRENTLY rag_chunks_hnsw ON rag_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 32, ef_construction = 200);
pgvector 0.7+ supports parallel HNSW builds — 4-8x faster on 8-core machines.
Step 3 — Generate a recall ground truth
import psycopg, numpy as np
conn = psycopg.connect(...)
def brute_force_topk(q: list[float], k: int = 10):
with conn.cursor() as cur:
cur.execute("SET LOCAL enable_indexscan = off")
cur.execute(
"""
SELECT id FROM rag_chunks
ORDER BY embedding <=> %s::vector LIMIT %s
""",
(q, k),
)
return [r[0] for r in cur.fetchall()]
Run brute-force on 200 sampled queries, store as ground truth.
Hear it before you finish reading
Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.
Step 4 — Sweep ef_search
def hnsw_topk(q, k=10, ef=100):
with conn.cursor() as cur:
cur.execute(f"SET LOCAL hnsw.ef_search = {ef}")
cur.execute(
"SELECT id FROM rag_chunks ORDER BY embedding <=> %s::vector LIMIT %s",
(q, k),
)
return [r[0] for r in cur.fetchall()]
for ef in [40, 80, 120, 160, 200, 300]:
hits, lat = [], []
for q, gt in samples:
t0 = time.perf_counter()
ids = hnsw_topk(q, ef=ef)
lat.append(time.perf_counter() - t0)
hits.append(len(set(ids) & set(gt)) / 10)
print(f"ef={ef} recall={np.mean(hits):.3f} p95={np.percentile(lat,95)*1000:.1f}ms")
Step 5 — Read the curve, pick a point
Typical 10M-row result on a 16-vCPU Postgres:
| ef_search | recall@10 | p95 latency |
|---|---|---|
| 40 | 0.86 | 8 ms |
| 100 | 0.94 | 14 ms |
| 200 | 0.98 | 26 ms |
| 400 | 0.99 | 51 ms |
For an agent that hits memory once per turn, 200 is the sweet spot.
Step 6 — Production monitoring
SELECT relname, idx_scan, idx_tup_read, idx_tup_fetch,
pg_size_pretty(pg_relation_size(indexrelid)) AS idx_size
FROM pg_stat_user_indexes
WHERE indexrelname = 'rag_chunks_hnsw';
Track index size weekly — HNSW grows ~1.5–2x the raw vector size at m=32.
Pitfalls
- Building before load — wastes hours, produces worse graphs. Always load first.
maintenance_work_memtoo small — index spills to disk, build slows 10x. Set it to 25-50% of RAM.- Filtering on un-indexed columns —
WHERE tenant_id = $1 ORDER BY embedding <=> $2is post-filtered. Use a partial HNSW or pgvectorscale's StreamingDiskANN. - Ignoring write amplification — every UPDATE to
embeddingrebuilds graph edges. Batch updates.
CallSphere production note
CallSphere's RAG layer indexes 8M+ chunks across 115+ DB tables with m=24, ef_construction=128, ef_search=160. Healthcare and Behavioral Health verticals run on a HIPAA-isolated healthcare_voice Prisma schema; OneRoof uses RLS-scoped HNSW indexes per landlord; UrackIT keeps its non-HIPAA RAG on Supabase + ChromaDB. 37 agents · 90+ tools · 6 verticals. Plans: $149 / $499 / $1,499, 14-day trial, 22% affiliate.
Still reading? Stop comparing — try CallSphere live.
CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.
FAQ
Q: Does SET hnsw.ef_search need to be SESSION-scoped?
SET LOCAL inside the transaction is safest — avoids leaking to pooled connections.
Q: When is IVFFlat actually better than HNSW? Memory-constrained boxes (<8 GB) and >100M vectors with low QPS.
Q: Should I rebuild the index after bulk imports? Only if you imported >20% of total rows. HNSW handles incremental inserts well.
Q: Can I use halfvec to halve memory?
Yes — pgvector 0.7+ ships halfvec(n). Recall drop is usually <1%, memory savings 50%.
Q: What about pgvectorscale? StreamingDiskANN beats HNSW past ~50M vectors. Worth evaluating if you outgrow pgvector.
Sources
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.