Skip to content
Learn Agentic AI
Learn Agentic AI13 min read9 views

Semantic Search Autocomplete: AI-Powered Query Suggestions and Completion

Build an intelligent autocomplete system that suggests semantically relevant queries as users type, combining query embeddings with popularity weighting and user personalization for a superior search experience.

Beyond Prefix Matching

Traditional autocomplete systems use prefix matching: type "mach" and get "machine learning," "machine vision," "machining." This works for exact prefixes but fails when users phrase things differently. Typing "how to train" will never suggest "fine-tuning a neural network" with prefix matching, even though they express the same intent.

Semantic autocomplete uses embeddings to suggest queries that are semantically related to what the user has typed so far, regardless of prefix overlap. Combined with popularity signals and personalization, this creates an autocomplete experience that genuinely anticipates what users are looking for.

Building the Suggestion Index

The suggestion index stores previously successful queries along with their embeddings and popularity scores.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
from dataclasses import dataclass, field
from typing import List, Optional, Dict
from sentence_transformers import SentenceTransformer
import numpy as np
from collections import defaultdict
import time

@dataclass
class QuerySuggestion:
    text: str
    count: int = 0          # how many times this query was searched
    click_rate: float = 0.0  # fraction of searches that led to a click
    last_used: float = 0.0
    categories: List[str] = field(default_factory=list)

class SuggestionIndex:
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
        self.suggestions: List[QuerySuggestion] = []
        self.embeddings: Optional[np.ndarray] = None
        self.text_to_idx: Dict[str, int] = {}

    def build(self, suggestions: List[QuerySuggestion]):
        """Embed and index all suggestions."""
        self.suggestions = suggestions
        self.text_to_idx = {
            s.text.lower(): i for i, s in enumerate(suggestions)
        }
        texts = [s.text for s in suggestions]
        self.embeddings = self.model.encode(
            texts,
            normalize_embeddings=True,
            batch_size=128,
            show_progress_bar=True,
        )
        print(f"Indexed {len(suggestions)} suggestions")

    def add_suggestion(self, suggestion: QuerySuggestion):
        """Add a single new suggestion to the index."""
        embedding = self.model.encode(
            [suggestion.text], normalize_embeddings=True
        )
        idx = len(self.suggestions)
        self.suggestions.append(suggestion)
        self.text_to_idx[suggestion.text.lower()] = idx

        if self.embeddings is None:
            self.embeddings = embedding
        else:
            self.embeddings = np.vstack([self.embeddings, embedding])

    def record_search(self, query_text: str, had_click: bool):
        """Update statistics when a query is executed."""
        key = query_text.lower().strip()
        if key in self.text_to_idx:
            idx = self.text_to_idx[key]
            s = self.suggestions[idx]
            s.count += 1
            total = s.count
            s.click_rate = (
                (s.click_rate * (total - 1) + (1.0 if had_click else 0.0))
                / total
            )
            s.last_used = time.time()
        else:
            self.add_suggestion(QuerySuggestion(
                text=query_text.strip(),
                count=1,
                click_rate=1.0 if had_click else 0.0,
                last_used=time.time(),
            ))

The Autocomplete Engine

The engine combines semantic similarity with popularity and recency signals to rank suggestions.

class SemanticAutocomplete:
    def __init__(
        self,
        index: SuggestionIndex,
        semantic_weight: float = 0.5,
        popularity_weight: float = 0.3,
        recency_weight: float = 0.1,
        click_rate_weight: float = 0.1,
    ):
        self.index = index
        self.semantic_weight = semantic_weight
        self.popularity_weight = popularity_weight
        self.recency_weight = recency_weight
        self.click_rate_weight = click_rate_weight

    def suggest(
        self,
        partial_query: str,
        top_k: int = 8,
        prefix_boost: float = 0.2,
    ) -> List[Dict]:
        """Generate autocomplete suggestions for a partial query."""
        if len(partial_query.strip()) < 2:
            return self._popular_suggestions(top_k)

        query_emb = self.index.model.encode(
            [partial_query], normalize_embeddings=True
        )
        semantic_scores = np.dot(
            self.index.embeddings, query_emb.T
        ).flatten()

        # Normalize popularity scores
        counts = np.array([
            s.count for s in self.index.suggestions
        ], dtype=float)
        max_count = max(counts.max(), 1)
        popularity_scores = counts / max_count

        # Recency: exponential decay, half-life of 7 days
        now = time.time()
        recency_scores = np.array([
            np.exp(-(now - s.last_used) / (7 * 86400))
            if s.last_used > 0 else 0.0
            for s in self.index.suggestions
        ])

        click_scores = np.array([
            s.click_rate for s in self.index.suggestions
        ])

        # Combined score
        combined = (
            self.semantic_weight * semantic_scores
            + self.popularity_weight * popularity_scores
            + self.recency_weight * recency_scores
            + self.click_rate_weight * click_scores
        )

        # Prefix boost for suggestions that start with the partial query
        partial_lower = partial_query.lower().strip()
        for i, s in enumerate(self.index.suggestions):
            if s.text.lower().startswith(partial_lower):
                combined[i] += prefix_boost

        top_indices = np.argsort(combined)[::-1][:top_k]

        results = []
        for idx in top_indices:
            s = self.index.suggestions[idx]
            results.append({
                "text": s.text,
                "score": float(combined[idx]),
                "semantic_score": float(semantic_scores[idx]),
                "popularity": int(s.count),
                "categories": s.categories,
            })
        return results

    def _popular_suggestions(self, top_k: int) -> List[Dict]:
        """Return most popular suggestions when query is too short."""
        sorted_suggestions = sorted(
            enumerate(self.index.suggestions),
            key=lambda x: x[1].count,
            reverse=True,
        )
        return [
            {
                "text": s.text,
                "score": 0.0,
                "popularity": s.count,
                "categories": s.categories,
            }
            for _, s in sorted_suggestions[:top_k]
        ]

Personalized Suggestions

Personalization uses the user's search history to boost suggestions that align with their interests.

class PersonalizedAutocomplete:
    def __init__(self, base_engine: SemanticAutocomplete):
        self.base = base_engine
        self.user_profiles: Dict[str, np.ndarray] = {}

    def update_profile(self, user_id: str, query: str):
        """Update user profile with their latest query."""
        query_emb = self.base.index.model.encode(
            [query], normalize_embeddings=True
        )[0]

        if user_id in self.user_profiles:
            # Exponential moving average
            alpha = 0.3
            self.user_profiles[user_id] = (
                alpha * query_emb
                + (1 - alpha) * self.user_profiles[user_id]
            )
            # Re-normalize
            norm = np.linalg.norm(self.user_profiles[user_id])
            self.user_profiles[user_id] /= norm
        else:
            self.user_profiles[user_id] = query_emb

    def suggest(
        self,
        partial_query: str,
        user_id: Optional[str] = None,
        top_k: int = 8,
        personalization_weight: float = 0.15,
    ) -> List[Dict]:
        """Suggest with optional personalization."""
        base_results = self.base.suggest(partial_query, top_k=top_k * 2)

        if user_id and user_id in self.user_profiles:
            profile = self.user_profiles[user_id]
            for result in base_results:
                sugg_emb = self.base.index.model.encode(
                    [result["text"]], normalize_embeddings=True
                )[0]
                personal_score = float(np.dot(profile, sugg_emb))
                result["score"] += personalization_weight * personal_score
                result["personalized"] = True

            base_results.sort(key=lambda r: r["score"], reverse=True)

        return base_results[:top_k]

Building the Fast API Endpoint

Autocomplete must be fast — users expect suggestions within 50-100ms. Here is a FastAPI endpoint that serves suggestions efficiently.

from fastapi import FastAPI, Query
from fastapi.responses import JSONResponse

app = FastAPI()

# Initialize at startup
suggestion_index = SuggestionIndex()
autocomplete = SemanticAutocomplete(suggestion_index)
personalized = PersonalizedAutocomplete(autocomplete)

@app.get("/api/suggest")
async def get_suggestions(
    q: str = Query(..., min_length=1, max_length=200),
    user_id: str = Query(None),
    limit: int = Query(8, ge=1, le=20),
):
    suggestions = personalized.suggest(
        partial_query=q,
        user_id=user_id,
        top_k=limit,
    )
    return JSONResponse(
        content={"suggestions": suggestions},
        headers={"Cache-Control": "public, max-age=60"},
    )

@app.post("/api/search-event")
async def record_search(query: str, user_id: str = None, clicked: bool = False):
    """Record search execution for popularity tracking."""
    suggestion_index.record_search(query, clicked)
    if user_id:
        personalized.update_profile(user_id, query)
    return {"status": "recorded"}

Performance Optimizations

For sub-50ms response times:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

  1. Cache embeddings — cache the partial query embedding for debounced requests where the user is still typing.
  2. Quantize the index — use int8 quantization for suggestion embeddings to reduce memory and speed up dot products.
  3. Limit candidate pool — only score the top 1000 suggestions by a cheap pre-filter (prefix match + popularity), then apply semantic scoring.
  4. Precompute popular — cache the top-10 popular suggestions so empty-query requests are instant.

FAQ

How do I prevent low-quality or offensive suggestions from appearing?

Maintain a blocklist of terms and patterns that should never appear in suggestions. Before adding any new query to the suggestion index, run it through a content filter. Additionally, set a minimum search count threshold (e.g., 3 searches) before a query becomes eligible for suggestions. This prevents one-off typos or adversarial queries from polluting the suggestion pool.

How often should I rebuild the suggestion index vs updating it incrementally?

Use incremental updates (add_suggestion and record_search) for real-time responsiveness, and schedule a full rebuild weekly. The rebuild recalculates all embeddings (catching model improvements), prunes suggestions with zero searches in the last 30 days, and recomputes normalized popularity scores. This keeps the index clean and the scores well-calibrated without disrupting service.

How do I handle misspelled partial queries?

Combine semantic autocomplete with a lightweight spell-correction layer. Before embedding the partial query, check if it has a close match in your suggestion vocabulary using edit distance. If the corrected form has significantly higher popularity, use the corrected embedding. Libraries like symspellpy provide fast spell correction that adds under 1ms of latency. The semantic embedding itself is somewhat robust to minor typos since transformer tokenizers handle subword variations.


#Autocomplete #QuerySuggestions #SearchUX #SemanticSearch #Personalization #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Infrastructure

Agent Personalization at Scale: Patterns That Work for 1M Users

Personalizing agents for one user is easy. Personalizing them for a million users is a memory-tier problem. The hot/warm/cold split and what each tier optimizes for.

Agentic AI

Agent Personalization for SaaS Onboarding Flows: Conversion Lift

Personalizing onboarding agents lifts trial-to-paid by 18% in published case studies. The memory architecture that makes it work and the metrics it actually moves.

AI Engineering

Cross-Session Agent Memory: Production Architecture in 2026

How to remember a user across days, sessions, and devices without leaking context between accounts. The three-layer pattern that works at production scale safely.

Agentic AI

Women's Wellness & FemTech D2C Chat Agents: Cycle, Hormone, and Life-Stage Personalization in 2026

FemTech reaches $75B by 2026 with AI-personalized care across life stages. Chat agents that read cycle, life stage, and concern lift D2C wellness conversion 30%+. Here is the 2026 playbook.

Hotels & Hospitality

Luxury Resort Spa: Agentic AI for Personalized Guest Experience

Luxury resort spas compete on personalization. AI voice agents recognize VIPs, remember preferences, and coordinate spa + dining + activity bookings seamlessly.

Learn Agentic AI

Semantic Search for AI Agents: Embedding Models, Chunking Strategies, and Retrieval Optimization

Comprehensive guide to semantic search for AI agents covering embedding model selection, document chunking strategies, and retrieval optimization techniques for production systems.