Skip to content
Learn Agentic AI
Learn Agentic AI17 min read26 views

Semantic Search for AI Agents: Embedding Models, Chunking Strategies, and Retrieval Optimization

Comprehensive guide to semantic search for AI agents covering embedding model selection, document chunking strategies, and retrieval optimization techniques for production systems.

Semantic Search Is the Foundation of Agent Intelligence

Every AI agent that accesses external knowledge relies on semantic search. When an agent needs to find relevant context — whether from a company knowledge base, product documentation, or historical conversation logs — it translates the query into a vector, searches for similar vectors, and retrieves the matching content. The quality of this retrieval directly determines the quality of the agent's response.

Three technical decisions control retrieval quality: the embedding model that converts text to vectors, the chunking strategy that splits documents into searchable units, and the retrieval pipeline that finds and ranks results. Getting any one of these wrong degrades the entire system. This guide provides the technical depth needed to make each decision correctly.

Embedding Model Selection

Embedding models are the neural networks that convert text into fixed-dimensional vectors. The choice of model affects semantic accuracy, supported languages, vector dimensionality (which affects storage cost and search speed), and maximum input length.

flowchart LR
    PR(["PR opened"])
    UNIT["Unit tests"]
    EVAL["Eval harness<br/>PromptFoo or Braintrust"]
    GOLD[("Golden set<br/>200 tagged cases")]
    JUDGE["LLM as judge<br/>plus regex graders"]
    SCORE["Aggregate score<br/>and per slice"]
    GATE{"Score regress<br/>more than 2 percent?"}
    BLOCK(["Block merge"])
    MERGE(["Merge to main"])
    PR --> UNIT --> EVAL --> GOLD --> JUDGE --> SCORE --> GATE
    GATE -->|Yes| BLOCK
    GATE -->|No| MERGE
    style EVAL fill:#4f46e5,stroke:#4338ca,color:#fff
    style GATE fill:#f59e0b,stroke:#d97706,color:#1f2937
    style BLOCK fill:#dc2626,stroke:#b91c1c,color:#fff
    style MERGE fill:#059669,stroke:#047857,color:#fff

Leading Models in 2026

OpenAI text-embedding-3-large (3072 dimensions, 8191 token max input). The current quality leader for English text. Supports dimension reduction via the dimensions parameter — you can request 1536 or even 256 dimensions for faster search with a modest quality drop. Pricing: $0.13 per million tokens.

Cohere embed-v4 (1024 dimensions, 512 token max input). Excels at multilingual retrieval and has a unique search-document / search-query input type parameter that optimizes embeddings for asymmetric search. Best price-performance ratio for multilingual use cases.

Voyage AI voyage-3 (1024 dimensions, 16000 token max input). The long-context specialist. If your documents are long and you want to embed large chunks without splitting, Voyage is the strongest option. Also supports code embedding with a dedicated code model.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

BGE-M3 (open source, 1024 dimensions, 8192 token max input). The best self-hosted option. Supports dense, sparse, and multi-vector retrieval in a single model. Run it on your own GPU with no API dependency.

from openai import OpenAI
import cohere
import numpy as np

class EmbeddingService:
    """Unified interface for multiple embedding providers."""

    def __init__(self, provider: str = "openai"):
        self.provider = provider
        if provider == "openai":
            self.client = OpenAI()
            self.model = "text-embedding-3-large"
            self.dimensions = 3072
        elif provider == "cohere":
            self.client = cohere.Client()
            self.model = "embed-v4"
            self.dimensions = 1024

    def embed_documents(self, texts: list[str]) -> list[list[float]]:
        if self.provider == "openai":
            response = self.client.embeddings.create(
                input=texts,
                model=self.model,
                dimensions=self.dimensions,
            )
            return [item.embedding for item in response.data]

        elif self.provider == "cohere":
            response = self.client.embed(
                texts=texts,
                model=self.model,
                input_type="search_document",
            )
            return response.embeddings

    def embed_query(self, text: str) -> list[float]:
        if self.provider == "openai":
            response = self.client.embeddings.create(
                input=[text],
                model=self.model,
                dimensions=self.dimensions,
            )
            return response.data[0].embedding

        elif self.provider == "cohere":
            response = self.client.embed(
                texts=[text],
                model=self.model,
                input_type="search_query",
            )
            return response.embeddings[0]

How to Benchmark for Your Domain

Do not trust generic benchmarks like MTEB. Embedding model performance varies dramatically by domain. A model that ranks first on general web text may rank third on legal documents or medical notes. Build a domain-specific evaluation set.

import numpy as np
from dataclasses import dataclass

@dataclass
class RetrievalTestCase:
    query: str
    relevant_doc_ids: list[str]

def evaluate_retrieval(
    embedding_service: EmbeddingService,
    test_cases: list[RetrievalTestCase],
    documents: dict[str, str],
    k: int = 5,
) -> dict:
    # Embed all documents
    doc_ids = list(documents.keys())
    doc_texts = list(documents.values())
    doc_embeddings = embedding_service.embed_documents(doc_texts)

    doc_matrix = np.array(doc_embeddings)
    doc_norms = np.linalg.norm(doc_matrix, axis=1, keepdims=True)
    doc_matrix_normed = doc_matrix / doc_norms

    recall_at_k = []
    mrr_scores = []

    for tc in test_cases:
        query_vec = np.array(embedding_service.embed_query(tc.query))
        query_normed = query_vec / np.linalg.norm(query_vec)

        scores = doc_matrix_normed @ query_normed
        top_k_indices = np.argsort(scores)[-k:][::-1]
        top_k_ids = [doc_ids[i] for i in top_k_indices]

        # Recall@k
        relevant_found = len(
            set(top_k_ids) & set(tc.relevant_doc_ids)
        )
        recall_at_k.append(relevant_found / len(tc.relevant_doc_ids))

        # MRR
        for rank, doc_id in enumerate(top_k_ids, 1):
            if doc_id in tc.relevant_doc_ids:
                mrr_scores.append(1.0 / rank)
                break
        else:
            mrr_scores.append(0.0)

    return {
        "recall_at_k": np.mean(recall_at_k),
        "mrr": np.mean(mrr_scores),
    }

Chunking Strategies

Chunking is how you split documents into searchable units. Get it wrong and your retrieval system either finds irrelevant fragments (chunks too small) or buries the answer in noise (chunks too large). There is no universal best chunk size — it depends on your document types, query patterns, and embedding model.

Fixed-Size Chunking with Overlap

The simplest strategy: split text into chunks of N tokens with M tokens of overlap. Overlap ensures that information at chunk boundaries is not lost.

from langchain.text_splitter import RecursiveCharacterTextSplitter

def fixed_size_chunking(
    text: str, chunk_size: int = 512, chunk_overlap: int = 50
) -> list[str]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=["

", "
", ". ", " ", ""],
        length_function=len,
    )
    return splitter.split_text(text)

Good defaults: 400-600 characters for Q&A retrieval, 800-1200 characters for summarization retrieval. Overlap should be 10-15% of chunk size.

Semantic Chunking

Instead of splitting at arbitrary token boundaries, semantic chunking splits where the topic changes. It measures embedding similarity between consecutive sentences and splits where similarity drops below a threshold.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings

def semantic_chunking(text: str) -> list[str]:
    embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
    chunker = SemanticChunker(
        embeddings,
        breakpoint_threshold_type="percentile",
        breakpoint_threshold_amount=85,
    )
    docs = chunker.create_documents([text])
    return [doc.page_content for doc in docs]

Semantic chunking produces chunks of variable size that align with topic boundaries. This improves retrieval precision because each chunk is topically coherent — you rarely get a chunk that starts talking about one thing and ends talking about another.

Hierarchical Chunking

For long documents, use a two-level hierarchy: large parent chunks (1500-2000 tokens) contain small child chunks (300-500 tokens). Search is performed against child chunks for precision, but the parent chunk is returned for context. This gives you the best of both worlds.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

from dataclasses import dataclass

@dataclass
class HierarchicalChunk:
    parent_id: str
    child_id: str
    parent_content: str
    child_content: str

def hierarchical_chunking(
    text: str,
    parent_size: int = 1500,
    child_size: int = 400,
    child_overlap: int = 50,
) -> list[HierarchicalChunk]:
    # Split into parent chunks
    parent_splitter = RecursiveCharacterTextSplitter(
        chunk_size=parent_size, chunk_overlap=0
    )
    parents = parent_splitter.split_text(text)

    # Split each parent into children
    child_splitter = RecursiveCharacterTextSplitter(
        chunk_size=child_size, chunk_overlap=child_overlap
    )

    chunks = []
    for p_idx, parent in enumerate(parents):
        children = child_splitter.split_text(parent)
        for c_idx, child in enumerate(children):
            chunks.append(
                HierarchicalChunk(
                    parent_id=f"parent-{p_idx}",
                    child_id=f"parent-{p_idx}-child-{c_idx}",
                    parent_content=parent,
                    child_content=child,
                )
            )
    return chunks

Retrieval Optimization Techniques

Contextual Retrieval

Anthropic's contextual retrieval technique prepends a short context summary to each chunk before embedding. This dramatically improves retrieval because the chunk now carries context that would otherwise be lost during splitting.

async def add_context_to_chunks(
    chunks: list[str], full_document: str, llm
) -> list[str]:
    contextualized = []
    for chunk in chunks:
        prompt = f"""Given this document:
{full_document[:3000]}

And this specific chunk from it:
{chunk}

Write a 1-2 sentence context that explains where this chunk fits
in the overall document. Start with 'This chunk is about...'"""

        response = await llm.ainvoke(prompt)
        contextualized.append(
            f"{response.content}

{chunk}"
        )
    return contextualized

Query Expansion

Expand a single query into multiple formulations to improve recall. This is especially effective for short or ambiguous queries.

async def expand_query(query: str, llm, n_expansions: int = 3) -> list[str]:
    prompt = f"""Generate {n_expansions} alternative phrasings of this
search query. Each should capture the same intent but use different words.

Original query: {query}

Return only the alternative queries, one per line."""

    response = await llm.ainvoke(prompt)
    expansions = [q.strip() for q in response.content.strip().split("
") if q.strip()]
    return [query] + expansions[:n_expansions]

async def expanded_search(
    query: str, vector_store, llm, top_k: int = 5
) -> list:
    queries = await expand_query(query, llm)
    all_results = []
    seen_ids = set()

    for q in queries:
        results = vector_store.similarity_search(q, k=top_k)
        for r in results:
            doc_id = r.page_content[:100]
            if doc_id not in seen_ids:
                all_results.append(r)
                seen_ids.add(doc_id)

    return all_results[:top_k]

Hypothetical Document Embeddings (HyDE)

Instead of embedding the query directly, generate a hypothetical answer and embed that. The hypothesis is closer in embedding space to actual documents than the question is.

async def hyde_search(
    query: str, vector_store, llm, embedding_service, top_k: int = 5
) -> list:
    # Generate hypothetical answer
    prompt = f"""Write a detailed paragraph that would answer this question.
Write as if it is a passage from a reference document.

Question: {query}"""

    response = await llm.ainvoke(prompt)
    hypothesis = response.content

    # Embed the hypothesis instead of the query
    hyp_vector = embedding_service.embed_query(hypothesis)

    # Search with hypothesis embedding
    results = vector_store.similarity_search_by_vector(
        hyp_vector, k=top_k
    )
    return results

Putting It All Together: Production Pipeline

class ProductionRetrievalPipeline:
    def __init__(self, config: dict):
        self.embedding = EmbeddingService(config["embedding_provider"])
        self.vector_store = config["vector_store"]
        self.llm = config["llm"]
        self.use_hyde = config.get("use_hyde", False)
        self.use_expansion = config.get("use_expansion", True)
        self.use_reranking = config.get("use_reranking", True)

    async def ingest(self, documents: list[dict]):
        for doc in documents:
            # Step 1: Chunk
            chunks = semantic_chunking(doc["content"])

            # Step 2: Add context
            chunks = await add_context_to_chunks(
                chunks, doc["content"], self.llm
            )

            # Step 3: Embed and store
            vectors = self.embedding.embed_documents(chunks)
            self.vector_store.add(
                vectors=vectors,
                documents=chunks,
                metadatas=[doc["metadata"]] * len(chunks),
            )

    async def search(self, query: str, top_k: int = 5) -> list[str]:
        # Step 1: Optional query expansion
        if self.use_expansion:
            results = await expanded_search(
                query, self.vector_store, self.llm, top_k=20
            )
        else:
            results = self.vector_store.similarity_search(query, k=20)

        # Step 2: Optional re-ranking
        if self.use_reranking:
            reranker = ReRanker()
            results = reranker.rerank(
                query,
                [SearchResult(content=r.page_content, metadata=r.metadata, score=0)
                 for r in results],
                top_k=top_k,
            )
            return [r.content for r in results]

        return [r.page_content for r in results[:top_k]]

FAQ

What chunk size should I use for my specific use case?

Start with 500 characters and test. For factual Q&A (customer support, documentation), smaller chunks (300-500 characters) work best because answers are typically contained in a single paragraph. For analytical queries (research, summarization), larger chunks (800-1500 characters) provide more context. The most reliable approach is to build a test set of 50 queries with known answers, then benchmark different chunk sizes against recall at k=5. Most teams find their optimal size between 400 and 800 characters.

How much does embedding model quality actually affect retrieval?

Significantly. In controlled benchmarks, the gap between the best and worst mainstream embedding models is 15-20% recall at k=5. However, the gap between the top 3 models is only 2-4%. This means the choice between OpenAI, Cohere, and Voyage matters much less than the choice between any of these and a cheap or outdated model. Where embedding model choice matters most is multilingual retrieval (Cohere leads) and long-document retrieval (Voyage leads).

Should I use semantic chunking or fixed-size chunking?

Semantic chunking produces higher-quality chunks but is slower (requires embedding every sentence to find breakpoints) and non-deterministic (different runs may produce different splits). Use semantic chunking when document quality varies and topics shift frequently within documents. Use fixed-size chunking for homogeneous documents (product specs, legal clauses, API documentation) where the structure is already consistent. For most production systems, fixed-size chunking with a well-tuned size and 10% overlap provides 90% of the quality at 10% of the cost.

How do I evaluate whether my retrieval pipeline is actually good enough?

Build a golden test set: 100 queries paired with the document chunks that contain the correct answer. Measure recall at k=5 (what percentage of queries have the answer in the top 5 results) and MRR (mean reciprocal rank — how high the first correct result appears). Target recall at k=5 above 85% and MRR above 0.6. If you fall short, the improvement priority is: (1) fix chunking, (2) add re-ranking, (3) try query expansion, (4) switch embedding models. Most retrieval failures are caused by bad chunking, not bad embeddings.


#SemanticSearch #Embeddings #Chunking #RetrievalOptimization #RAG #VectorSearch #AIAgents #LLM

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Strategy

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Q1 2026 saw a record acquisition wave: Aircall bought Vogent (May), Meta acquired Manus and PlayAI, OpenAI closed six deals. The voice AI consolidation phase has begun.

Agentic AI

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

How LangGraph's StateGraph, channels, and reducers actually work — with a working multi-step agent, eval hooks at every node, and the patterns that survive production.

Agentic AI

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

Use LangGraph's checkpointer to make agents resumable across crashes and human-in-the-loop pauses, then replay any checkpoint into your eval pipeline.

Agentic AI

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

Handoffs done right — when one agent should hand control to another, how to preserve context, and how to evaluate the handoff decision itself.

Agentic AI

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.

Agentic AI

LangGraph Supervisor Pattern: Orchestrating Multi-Agent Teams in 2026

The supervisor pattern in LangGraph for coordinating specialist agents, with full code, an eval pipeline that scores routing accuracy, and the failure modes to watch for.