Semantic Search Autocomplete: AI-Powered Query Suggestions and Completion

Beyond Prefix Matching

Traditional autocomplete systems use prefix matching: type "mach" and get "machine learning," "machine vision," "machining." This works for exact prefixes but fails when users phrase things differently. Typing "how to train" will never suggest "fine-tuning a neural network" with prefix matching, even though they express the same intent.

Semantic autocomplete uses embeddings to suggest queries that are semantically related to what the user has typed so far, regardless of prefix overlap. Combined with popularity signals and personalization, this creates an autocomplete experience that genuinely anticipates what users are looking for.

Building the Suggestion Index

The suggestion index stores previously successful queries along with their embeddings and popularity scores.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

from dataclasses import dataclass, field
from typing import List, Optional, Dict
from sentence_transformers import SentenceTransformer
import numpy as np
from collections import defaultdict
import time

@dataclass
class QuerySuggestion:
    text: str
    count: int = 0          # how many times this query was searched
    click_rate: float = 0.0  # fraction of searches that led to a click
    last_used: float = 0.0
    categories: List[str] = field(default_factory=list)

class SuggestionIndex:
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
        self.suggestions: List[QuerySuggestion] = []
        self.embeddings: Optional[np.ndarray] = None
        self.text_to_idx: Dict[str, int] = {}

    def build(self, suggestions: List[QuerySuggestion]):
        """Embed and index all suggestions."""
        self.suggestions = suggestions
        self.text_to_idx = {
            s.text.lower(): i for i, s in enumerate(suggestions)
        }
        texts = [s.text for s in suggestions]
        self.embeddings = self.model.encode(
            texts,
            normalize_embeddings=True,
            batch_size=128,
            show_progress_bar=True,
        )
        print(f"Indexed {len(suggestions)} suggestions")

    def add_suggestion(self, suggestion: QuerySuggestion):
        """Add a single new suggestion to the index."""
        embedding = self.model.encode(
            [suggestion.text], normalize_embeddings=True
        )
        idx = len(self.suggestions)
        self.suggestions.append(suggestion)
        self.text_to_idx[suggestion.text.lower()] = idx

        if self.embeddings is None:
            self.embeddings = embedding
        else:
            self.embeddings = np.vstack([self.embeddings, embedding])

    def record_search(self, query_text: str, had_click: bool):
        """Update statistics when a query is executed."""
        key = query_text.lower().strip()
        if key in self.text_to_idx:
            idx = self.text_to_idx[key]
            s = self.suggestions[idx]
            s.count += 1
            total = s.count
            s.click_rate = (
                (s.click_rate * (total - 1) + (1.0 if had_click else 0.0))
                / total
            )
            s.last_used = time.time()
        else:
            self.add_suggestion(QuerySuggestion(
                text=query_text.strip(),
                count=1,
                click_rate=1.0 if had_click else 0.0,
                last_used=time.time(),
            ))

The Autocomplete Engine

The engine combines semantic similarity with popularity and recency signals to rank suggestions.

class SemanticAutocomplete:
    def __init__(
        self,
        index: SuggestionIndex,
        semantic_weight: float = 0.5,
        popularity_weight: float = 0.3,
        recency_weight: float = 0.1,
        click_rate_weight: float = 0.1,
    ):
        self.index = index
        self.semantic_weight = semantic_weight
        self.popularity_weight = popularity_weight
        self.recency_weight = recency_weight
        self.click_rate_weight = click_rate_weight

    def suggest(
        self,
        partial_query: str,
        top_k: int = 8,
        prefix_boost: float = 0.2,
    ) -> List[Dict]:
        """Generate autocomplete suggestions for a partial query."""
        if len(partial_query.strip()) < 2:
            return self._popular_suggestions(top_k)

        query_emb = self.index.model.encode(
            [partial_query], normalize_embeddings=True
        )
        semantic_scores = np.dot(
            self.index.embeddings, query_emb.T
        ).flatten()

        # Normalize popularity scores
        counts = np.array([
            s.count for s in self.index.suggestions
        ], dtype=float)
        max_count = max(counts.max(), 1)
        popularity_scores = counts / max_count

        # Recency: exponential decay, half-life of 7 days
        now = time.time()
        recency_scores = np.array([
            np.exp(-(now - s.last_used) / (7 * 86400))
            if s.last_used > 0 else 0.0
            for s in self.index.suggestions
        ])

        click_scores = np.array([
            s.click_rate for s in self.index.suggestions
        ])

        # Combined score
        combined = (
            self.semantic_weight * semantic_scores
            + self.popularity_weight * popularity_scores
            + self.recency_weight * recency_scores
            + self.click_rate_weight * click_scores
        )

        # Prefix boost for suggestions that start with the partial query
        partial_lower = partial_query.lower().strip()
        for i, s in enumerate(self.index.suggestions):
            if s.text.lower().startswith(partial_lower):
                combined[i] += prefix_boost

        top_indices = np.argsort(combined)[::-1][:top_k]

        results = []
        for idx in top_indices:
            s = self.index.suggestions[idx]
            results.append({
                "text": s.text,
                "score": float(combined[idx]),
                "semantic_score": float(semantic_scores[idx]),
                "popularity": int(s.count),
                "categories": s.categories,
            })
        return results

    def _popular_suggestions(self, top_k: int) -> List[Dict]:
        """Return most popular suggestions when query is too short."""
        sorted_suggestions = sorted(
            enumerate(self.index.suggestions),
            key=lambda x: x[1].count,
            reverse=True,
        )
        return [
            {
                "text": s.text,
                "score": 0.0,
                "popularity": s.count,
                "categories": s.categories,
            }
            for _, s in sorted_suggestions[:top_k]
        ]

Personalized Suggestions

Personalization uses the user's search history to boost suggestions that align with their interests.

class PersonalizedAutocomplete:
    def __init__(self, base_engine: SemanticAutocomplete):
        self.base = base_engine
        self.user_profiles: Dict[str, np.ndarray] = {}

    def update_profile(self, user_id: str, query: str):
        """Update user profile with their latest query."""
        query_emb = self.base.index.model.encode(
            [query], normalize_embeddings=True
        )[0]

        if user_id in self.user_profiles:
            # Exponential moving average
            alpha = 0.3
            self.user_profiles[user_id] = (
                alpha * query_emb
                + (1 - alpha) * self.user_profiles[user_id]
            )
            # Re-normalize
            norm = np.linalg.norm(self.user_profiles[user_id])
            self.user_profiles[user_id] /= norm
        else:
            self.user_profiles[user_id] = query_emb

    def suggest(
        self,
        partial_query: str,
        user_id: Optional[str] = None,
        top_k: int = 8,
        personalization_weight: float = 0.15,
    ) -> List[Dict]:
        """Suggest with optional personalization."""
        base_results = self.base.suggest(partial_query, top_k=top_k * 2)

        if user_id and user_id in self.user_profiles:
            profile = self.user_profiles[user_id]
            for result in base_results:
                sugg_emb = self.base.index.model.encode(
                    [result["text"]], normalize_embeddings=True
                )[0]
                personal_score = float(np.dot(profile, sugg_emb))
                result["score"] += personalization_weight * personal_score
                result["personalized"] = True

            base_results.sort(key=lambda r: r["score"], reverse=True)

        return base_results[:top_k]

Building the Fast API Endpoint

Autocomplete must be fast — users expect suggestions within 50-100ms. Here is a FastAPI endpoint that serves suggestions efficiently.

from fastapi import FastAPI, Query
from fastapi.responses import JSONResponse

app = FastAPI()

# Initialize at startup
suggestion_index = SuggestionIndex()
autocomplete = SemanticAutocomplete(suggestion_index)
personalized = PersonalizedAutocomplete(autocomplete)

@app.get("/api/suggest")
async def get_suggestions(
    q: str = Query(..., min_length=1, max_length=200),
    user_id: str = Query(None),
    limit: int = Query(8, ge=1, le=20),
):
    suggestions = personalized.suggest(
        partial_query=q,
        user_id=user_id,
        top_k=limit,
    )
    return JSONResponse(
        content={"suggestions": suggestions},
        headers={"Cache-Control": "public, max-age=60"},
    )

@app.post("/api/search-event")
async def record_search(query: str, user_id: str = None, clicked: bool = False):
    """Record search execution for popularity tracking."""
    suggestion_index.record_search(query, clicked)
    if user_id:
        personalized.update_profile(user_id, query)
    return {"status": "recorded"}

Performance Optimizations

For sub-50ms response times:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

Cache embeddings — cache the partial query embedding for debounced requests where the user is still typing.
Quantize the index — use int8 quantization for suggestion embeddings to reduce memory and speed up dot products.
Limit candidate pool — only score the top 1000 suggestions by a cheap pre-filter (prefix match + popularity), then apply semantic scoring.
Precompute popular — cache the top-10 popular suggestions so empty-query requests are instant.

FAQ

How do I prevent low-quality or offensive suggestions from appearing?

Maintain a blocklist of terms and patterns that should never appear in suggestions. Before adding any new query to the suggestion index, run it through a content filter. Additionally, set a minimum search count threshold (e.g., 3 searches) before a query becomes eligible for suggestions. This prevents one-off typos or adversarial queries from polluting the suggestion pool.

How often should I rebuild the suggestion index vs updating it incrementally?

Use incremental updates (add_suggestion and record_search) for real-time responsiveness, and schedule a full rebuild weekly. The rebuild recalculates all embeddings (catching model improvements), prunes suggestions with zero searches in the last 30 days, and recomputes normalized popularity scores. This keeps the index clean and the scores well-calibrated without disrupting service.

How do I handle misspelled partial queries?

Combine semantic autocomplete with a lightweight spell-correction layer. Before embedding the partial query, check if it has a close match in your suggestion vocabulary using edit distance. If the corrected form has significantly higher popularity, use the corrected embedding. Libraries like symspellpy provide fast spell correction that adds under 1ms of latency. The semantic embedding itself is somewhat robust to minor typos since transformer tokenizers handle subword variations.

#Autocomplete #QuerySuggestions #SearchUX #SemanticSearch #Personalization #AgenticAI #LearnAI #AIEngineering

Semantic Search Autocomplete: AI-Powered Query Suggestions and Completion

Beyond Prefix Matching

Building the Suggestion Index

The Autocomplete Engine

Personalized Suggestions

Building the Fast API Endpoint

Performance Optimizations

FAQ

How do I prevent low-quality or offensive suggestions from appearing?

How often should I rebuild the suggestion index vs updating it incrementally?

How do I handle misspelled partial queries?

Try CallSphere AI Voice Agents

Related Articles You May Like

Agent Personalization at Scale: Patterns That Work for 1M Users

Agent Personalization for SaaS Onboarding Flows: Conversion Lift

Cross-Session Agent Memory: Production Architecture in 2026

Women's Wellness & FemTech D2C Chat Agents: Cycle, Hormone, and Life-Stage Personalization in 2026

Luxury Resort Spa: Agentic AI for Personalized Guest Experience

Semantic Search for AI Agents: Embedding Models, Chunking Strategies, and Retrieval Optimization