Skip to content
Learn Agentic AI
Learn Agentic AI14 min read10 views

Semantic Search for Code: Finding Functions, Classes, and Documentation

Build a semantic code search engine that finds relevant functions and classes by intent rather than identifier names, using code-specific embeddings from CodeBERT and AST-aware parsing to understand code structure.

Why Code Search Needs Semantics

Standard text search tools like grep or IDE find-in-files match literal strings. When you search for "validate email address," grep will only find functions that contain those exact words. But your codebase might have a function called check_email_format or is_valid_email that does exactly what you need. Semantic code search bridges this gap by understanding the intent behind code, matching natural language queries to code by meaning.

Extracting Code Units with AST Parsing

Before embedding code, we need to extract meaningful units — functions, classes, and their docstrings — using Abstract Syntax Tree (AST) parsing.

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff
import ast
from dataclasses import dataclass
from typing import List, Optional
from pathlib import Path

@dataclass
class CodeUnit:
    name: str
    type: str  # "function", "class", "method"
    docstring: Optional[str]
    signature: str
    body: str
    file_path: str
    line_number: int

    @property
    def search_text(self) -> str:
        """Combine all textual signals for embedding."""
        parts = [self.name.replace("_", " ")]
        if self.docstring:
            parts.append(self.docstring)
        parts.append(self.signature)
        return " . ".join(parts)

class PythonCodeParser:
    def parse_file(self, file_path: str) -> List[CodeUnit]:
        """Extract functions and classes from a Python file."""
        source = Path(file_path).read_text()
        tree = ast.parse(source, filename=file_path)
        units = []

        for node in ast.walk(tree):
            if isinstance(node, ast.FunctionDef):
                units.append(self._extract_function(node, file_path))
            elif isinstance(node, ast.ClassDef):
                units.append(self._extract_class(node, file_path))
                for item in node.body:
                    if isinstance(item, ast.FunctionDef):
                        method = self._extract_function(item, file_path)
                        method.type = "method"
                        method.name = f"{node.name}.{item.name}"
                        units.append(method)

        return units

    def _extract_function(
        self, node: ast.FunctionDef, file_path: str
    ) -> CodeUnit:
        args = [arg.arg for arg in node.args.args if arg.arg != "self"]
        signature = f"def {node.name}({', '.join(args)})"
        body = ast.get_source_segment(
            Path(file_path).read_text(), node
        ) or ""

        return CodeUnit(
            name=node.name,
            type="function",
            docstring=ast.get_docstring(node),
            signature=signature,
            body=body[:500],
            file_path=file_path,
            line_number=node.lineno,
        )

    def _extract_class(
        self, node: ast.ClassDef, file_path: str
    ) -> CodeUnit:
        bases = [
            b.id if isinstance(b, ast.Name) else "..." for b in node.bases
        ]
        signature = f"class {node.name}({', '.join(bases)})" if bases else f"class {node.name}"

        return CodeUnit(
            name=node.name,
            type="class",
            docstring=ast.get_docstring(node),
            signature=signature,
            body="",
            file_path=file_path,
            line_number=node.lineno,
        )

    def parse_directory(self, directory: str) -> List[CodeUnit]:
        """Recursively parse all Python files in a directory."""
        units = []
        for py_file in Path(directory).rglob("*.py"):
            try:
                units.extend(self.parse_file(str(py_file)))
            except SyntaxError:
                continue
        return units

Code-Specific Embedding Models

General-purpose text models work reasonably for code search, but code-specific models like CodeBERT or UniXcoder understand programming concepts better.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
from sentence_transformers import SentenceTransformer
import numpy as np

class CodeSearchEngine:
    def __init__(self):
        # UniXcoder handles both natural language and code well
        self.model = SentenceTransformer(
            "microsoft/unixcoder-base"
        )
        self.parser = PythonCodeParser()
        self.code_units: List[CodeUnit] = []
        self.embeddings: Optional[np.ndarray] = None

    def index_directory(self, directory: str):
        """Parse and embed all code in a directory."""
        self.code_units = self.parser.parse_directory(directory)

        search_texts = [unit.search_text for unit in self.code_units]
        self.embeddings = self.model.encode(
            search_texts,
            normalize_embeddings=True,
            batch_size=32,
            show_progress_bar=True,
        )
        print(f"Indexed {len(self.code_units)} code units")

    def search(
        self, query: str, top_k: int = 10, type_filter: str = None
    ) -> List[dict]:
        """Search code using natural language query."""
        query_emb = self.model.encode(
            [query], normalize_embeddings=True
        )
        scores = np.dot(self.embeddings, query_emb.T).flatten()
        top_indices = np.argsort(scores)[::-1]

        results = []
        for idx in top_indices:
            if len(results) >= top_k:
                break
            unit = self.code_units[idx]
            if type_filter and unit.type != type_filter:
                continue
            results.append({
                "name": unit.name,
                "type": unit.type,
                "signature": unit.signature,
                "docstring": unit.docstring or "No docstring",
                "file": unit.file_path,
                "line": unit.line_number,
                "score": float(scores[idx]),
            })
        return results

Combining Docstring and Code Body Embeddings

For higher quality results, embed the docstring and the code body separately, then combine their similarity scores.

class DualEmbeddingCodeSearch:
    def __init__(self):
        self.nl_model = SentenceTransformer("all-MiniLM-L6-v2")
        self.code_model = SentenceTransformer("microsoft/unixcoder-base")
        self.code_units: List[CodeUnit] = []
        self.doc_embeddings: Optional[np.ndarray] = None
        self.code_embeddings: Optional[np.ndarray] = None

    def index(self, code_units: List[CodeUnit]):
        self.code_units = code_units

        doc_texts = [
            unit.docstring or unit.name.replace("_", " ")
            for unit in code_units
        ]
        self.doc_embeddings = self.nl_model.encode(
            doc_texts, normalize_embeddings=True
        )

        code_texts = [unit.body[:300] or unit.signature for unit in code_units]
        self.code_embeddings = self.code_model.encode(
            code_texts, normalize_embeddings=True
        )

    def search(
        self,
        query: str,
        top_k: int = 10,
        doc_weight: float = 0.6,
        code_weight: float = 0.4,
    ) -> List[dict]:
        """Hybrid search using both docstring and code embeddings."""
        nl_query = self.nl_model.encode(
            [query], normalize_embeddings=True
        )
        code_query = self.code_model.encode(
            [query], normalize_embeddings=True
        )

        doc_scores = np.dot(self.doc_embeddings, nl_query.T).flatten()
        code_scores = np.dot(self.code_embeddings, code_query.T).flatten()

        combined = doc_weight * doc_scores + code_weight * code_scores
        top_indices = np.argsort(combined)[::-1][:top_k]

        return [
            {
                "name": self.code_units[i].name,
                "score": float(combined[i]),
                "doc_score": float(doc_scores[i]),
                "code_score": float(code_scores[i]),
                "file": self.code_units[i].file_path,
                "line": self.code_units[i].line_number,
            }
            for i in top_indices
        ]

FAQ

UniXcoder generally provides the best results for code search because it was pre-trained on both natural language and six programming languages with a unified cross-modal architecture. CodeBERT is a strong alternative. General-purpose models like all-MiniLM-L6-v2 work surprisingly well for docstring matching but struggle with raw code bodies. If your queries are natural language descriptions, a general model with docstring embeddings is often sufficient.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

How do I handle code that has no docstrings?

For undocumented code, construct a synthetic description from the function name (split on underscores and camelCase), parameter names, and return type annotations. For example, def calculate_monthly_payment(principal, rate, term) yields "calculate monthly payment with parameters principal, rate, term." This synthetic description is usually enough for basic semantic matching.

Can this approach work for languages other than Python?

Yes. The AST parsing layer needs to be language-specific — use tree-sitter for a universal parser that supports 40+ languages. The embedding and search layers remain identical. Tree-sitter provides consistent node types across languages, so you can extract functions, classes, and docstrings from JavaScript, Go, Rust, or Java with the same pipeline structure.


#CodeSearch #CodeBERT #ASTParsing #SemanticSearch #DeveloperTools #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.