Semantic Memory for AI Agents: Using Embeddings to Remember Relevant Facts

What Is Semantic Memory?

In cognitive science, semantic memory is the store of general knowledge and facts — distinct from episodic memory (specific events) and procedural memory (how to do things). For AI agents, semantic memory is a retrieval system that finds stored information based on meaning rather than exact keywords.

The core idea is simple: convert text into numerical vectors (embeddings) that capture semantic meaning, then use vector similarity to find the most relevant stored facts when the agent needs them. A query about "monthly subscription cost" should retrieve a memory stored as "The plan is priced at $49/month" even though the words barely overlap.

Generating Embeddings

Embeddings are produced by specialized models that map text to high-dimensional vectors. Similar meanings produce vectors that are close together in this space.

flowchart TD
    DOC(["Document"])
    CHUNK["Chunker<br/>recursive plus overlap"]
    EMB["Embedding model"]
    META["Attach metadata<br/>source, page, tenant"]
    INDEX[("HNSW or IVF index<br/>in vector store")]
    Q(["Query"])
    QEMB["Embed query"]
    SEARCH["ANN search<br/>cosine similarity"]
    FILTER["Metadata filter<br/>tenant or date"]
    HITS(["Top-k chunks"])
    DOC --> CHUNK --> EMB --> META --> INDEX
    Q --> QEMB --> SEARCH
    INDEX --> SEARCH --> FILTER --> HITS
    style INDEX fill:#4f46e5,stroke:#4338ca,color:#fff
    style HITS fill:#059669,stroke:#047857,color:#fff

import openai
import numpy as np
from typing import List

client = openai.OpenAI()

def embed_text(text: str) -> List[float]:
    """Generate an embedding vector for a single text string."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return response.data[0].embedding

def embed_batch(texts: List[str]) -> List[List[float]]:
    """Generate embeddings for multiple texts in one API call."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [item.embedding for item in response.data]

def cosine_similarity(a: List[float], b: List[float]) -> float:
    """Compute cosine similarity between two vectors."""
    a_arr, b_arr = np.array(a), np.array(b)
    return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))

The text-embedding-3-small model produces 1536-dimensional vectors and costs fractions of a cent per thousand tokens. For higher accuracy, text-embedding-3-large produces 3072 dimensions.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

Building a Semantic Memory Store

Here is a complete semantic memory implementation that stores facts with their embeddings and retrieves them by similarity.

from dataclasses import dataclass, field
from datetime import datetime
from typing import List, Optional, Tuple

@dataclass
class SemanticMemory:
    content: str
    embedding: List[float]
    category: str
    importance: float = 0.5  # 0.0 to 1.0
    access_count: int = 0
    created_at: datetime = field(default_factory=datetime.utcnow)
    last_accessed: datetime = field(default_factory=datetime.utcnow)

class SemanticMemoryStore:
    def __init__(self, similarity_threshold: float = 0.7):
        self.memories: List[SemanticMemory] = []
        self.threshold = similarity_threshold

    def add(self, content: str, category: str, importance: float = 0.5):
        embedding = embed_text(content)

        # Check for duplicates before adding
        similar = self._find_similar(embedding, threshold=0.92)
        if similar:
            # Update existing memory instead of creating duplicate
            existing = similar[0][0]
            existing.content = content
            existing.embedding = embedding
            existing.last_accessed = datetime.utcnow()
            return existing

        memory = SemanticMemory(
            content=content,
            embedding=embedding,
            category=category,
            importance=importance,
        )
        self.memories.append(memory)
        return memory

    def recall(
        self,
        query: str,
        top_k: int = 5,
        category: Optional[str] = None,
    ) -> List[Tuple[SemanticMemory, float]]:
        """Retrieve the most relevant memories for a query."""
        query_embedding = embed_text(query)
        results = self._find_similar(
            query_embedding, threshold=self.threshold, category=category
        )

        # Update access metadata
        for memory, score in results[:top_k]:
            memory.access_count += 1
            memory.last_accessed = datetime.utcnow()

        return results[:top_k]

    def _find_similar(
        self,
        embedding: List[float],
        threshold: float = 0.7,
        category: Optional[str] = None,
    ) -> List[Tuple[SemanticMemory, float]]:
        scored = []
        for mem in self.memories:
            if category and mem.category != category:
                continue
            score = cosine_similarity(embedding, mem.embedding)
            if score >= threshold:
                scored.append((mem, score))
        scored.sort(key=lambda x: x[1], reverse=True)
        return scored

Relevance-Weighted Retrieval

Raw cosine similarity is a good start, but production systems often combine similarity with recency and importance for a composite relevance score.

import math

def compute_relevance(
    similarity: float,
    memory: SemanticMemory,
    recency_weight: float = 0.2,
    importance_weight: float = 0.15,
) -> float:
    """Combine similarity, recency, and importance into a single score."""
    hours_ago = (datetime.utcnow() - memory.last_accessed).total_seconds() / 3600
    recency_score = math.exp(-0.01 * hours_ago)  # exponential decay

    return (
        (1 - recency_weight - importance_weight) * similarity
        + recency_weight * recency_score
        + importance_weight * memory.importance
    )

This formula ensures that recent, important memories rank higher when similarity scores are close.

Memory Consolidation

Over time, a semantic memory store accumulates redundant or overlapping entries. Consolidation merges similar memories to keep the store efficient.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

async def consolidate_memories(
    store: SemanticMemoryStore,
    merge_threshold: float = 0.88,
) -> int:
    """Merge highly similar memories to reduce redundancy."""
    merged_count = 0
    skip_indices = set()

    for i, mem_a in enumerate(store.memories):
        if i in skip_indices:
            continue
        for j, mem_b in enumerate(store.memories[i + 1:], start=i + 1):
            if j in skip_indices:
                continue
            sim = cosine_similarity(mem_a.embedding, mem_b.embedding)
            if sim >= merge_threshold:
                # Keep the more important or more recently accessed one
                if mem_b.importance > mem_a.importance:
                    mem_a.content = mem_b.content
                    mem_a.embedding = mem_b.embedding
                    mem_a.importance = max(mem_a.importance, mem_b.importance)
                mem_a.access_count += mem_b.access_count
                skip_indices.add(j)
                merged_count += 1

    store.memories = [
        m for i, m in enumerate(store.memories) if i not in skip_indices
    ]
    return merged_count

FAQ

How do I choose the right similarity threshold?

Start with 0.7 for general retrieval and tune based on your data. Lower thresholds (0.5-0.6) cast a wider net but include more noise. Higher thresholds (0.8+) are more precise but may miss relevant matches. Test with real queries from your domain and adjust.

Are there alternatives to OpenAI embeddings?

Yes. Open-source models like sentence-transformers/all-MiniLM-L6-v2 run locally with no API costs. Cohere and Voyage AI also offer embedding APIs. The choice depends on your latency, cost, and accuracy requirements.

How do I handle memory that becomes outdated?

Attach a timestamp and optionally a TTL (time-to-live) to each memory. Periodically sweep for expired entries. For facts that change — like a user's address — use the duplicate detection logic to overwrite the old entry rather than creating a conflicting one.

#SemanticMemory #Embeddings #VectorSearch #AIAgents #AgenticAI #LearnAI #AIEngineering

Semantic Memory for AI Agents: Using Embeddings to Remember Relevant Facts

What Is Semantic Memory?

Generating Embeddings

Building a Semantic Memory Store

Relevance-Weighted Retrieval

Memory Consolidation

FAQ

How do I choose the right similarity threshold?

Are there alternatives to OpenAI embeddings?

How do I handle memory that becomes outdated?

Try CallSphere AI Voice Agents

Related Articles You May Like

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)