Skip to content
Learn Agentic AI
Learn Agentic AI11 min read10 views

Semantic Memory for AI Agents: Using Embeddings to Remember Relevant Facts

Learn how to build a semantic memory system for AI agents using text embeddings, similarity thresholds, and memory consolidation to retrieve the most relevant facts from past interactions.

What Is Semantic Memory?

In cognitive science, semantic memory is the store of general knowledge and facts — distinct from episodic memory (specific events) and procedural memory (how to do things). For AI agents, semantic memory is a retrieval system that finds stored information based on meaning rather than exact keywords.

The core idea is simple: convert text into numerical vectors (embeddings) that capture semantic meaning, then use vector similarity to find the most relevant stored facts when the agent needs them. A query about "monthly subscription cost" should retrieve a memory stored as "The plan is priced at $49/month" even though the words barely overlap.

Generating Embeddings

Embeddings are produced by specialized models that map text to high-dimensional vectors. Similar meanings produce vectors that are close together in this space.

flowchart TD
    DOC(["Document"])
    CHUNK["Chunker<br/>recursive plus overlap"]
    EMB["Embedding model"]
    META["Attach metadata<br/>source, page, tenant"]
    INDEX[("HNSW or IVF index<br/>in vector store")]
    Q(["Query"])
    QEMB["Embed query"]
    SEARCH["ANN search<br/>cosine similarity"]
    FILTER["Metadata filter<br/>tenant or date"]
    HITS(["Top-k chunks"])
    DOC --> CHUNK --> EMB --> META --> INDEX
    Q --> QEMB --> SEARCH
    INDEX --> SEARCH --> FILTER --> HITS
    style INDEX fill:#4f46e5,stroke:#4338ca,color:#fff
    style HITS fill:#059669,stroke:#047857,color:#fff
import openai
import numpy as np
from typing import List

client = openai.OpenAI()

def embed_text(text: str) -> List[float]:
    """Generate an embedding vector for a single text string."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return response.data[0].embedding

def embed_batch(texts: List[str]) -> List[List[float]]:
    """Generate embeddings for multiple texts in one API call."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=texts,
    )
    return [item.embedding for item in response.data]

def cosine_similarity(a: List[float], b: List[float]) -> float:
    """Compute cosine similarity between two vectors."""
    a_arr, b_arr = np.array(a), np.array(b)
    return float(np.dot(a_arr, b_arr) / (np.linalg.norm(a_arr) * np.linalg.norm(b_arr)))

The text-embedding-3-small model produces 1536-dimensional vectors and costs fractions of a cent per thousand tokens. For higher accuracy, text-embedding-3-large produces 3072 dimensions.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Building a Semantic Memory Store

Here is a complete semantic memory implementation that stores facts with their embeddings and retrieves them by similarity.

from dataclasses import dataclass, field
from datetime import datetime
from typing import List, Optional, Tuple

@dataclass
class SemanticMemory:
    content: str
    embedding: List[float]
    category: str
    importance: float = 0.5  # 0.0 to 1.0
    access_count: int = 0
    created_at: datetime = field(default_factory=datetime.utcnow)
    last_accessed: datetime = field(default_factory=datetime.utcnow)

class SemanticMemoryStore:
    def __init__(self, similarity_threshold: float = 0.7):
        self.memories: List[SemanticMemory] = []
        self.threshold = similarity_threshold

    def add(self, content: str, category: str, importance: float = 0.5):
        embedding = embed_text(content)

        # Check for duplicates before adding
        similar = self._find_similar(embedding, threshold=0.92)
        if similar:
            # Update existing memory instead of creating duplicate
            existing = similar[0][0]
            existing.content = content
            existing.embedding = embedding
            existing.last_accessed = datetime.utcnow()
            return existing

        memory = SemanticMemory(
            content=content,
            embedding=embedding,
            category=category,
            importance=importance,
        )
        self.memories.append(memory)
        return memory

    def recall(
        self,
        query: str,
        top_k: int = 5,
        category: Optional[str] = None,
    ) -> List[Tuple[SemanticMemory, float]]:
        """Retrieve the most relevant memories for a query."""
        query_embedding = embed_text(query)
        results = self._find_similar(
            query_embedding, threshold=self.threshold, category=category
        )

        # Update access metadata
        for memory, score in results[:top_k]:
            memory.access_count += 1
            memory.last_accessed = datetime.utcnow()

        return results[:top_k]

    def _find_similar(
        self,
        embedding: List[float],
        threshold: float = 0.7,
        category: Optional[str] = None,
    ) -> List[Tuple[SemanticMemory, float]]:
        scored = []
        for mem in self.memories:
            if category and mem.category != category:
                continue
            score = cosine_similarity(embedding, mem.embedding)
            if score >= threshold:
                scored.append((mem, score))
        scored.sort(key=lambda x: x[1], reverse=True)
        return scored

Relevance-Weighted Retrieval

Raw cosine similarity is a good start, but production systems often combine similarity with recency and importance for a composite relevance score.

import math

def compute_relevance(
    similarity: float,
    memory: SemanticMemory,
    recency_weight: float = 0.2,
    importance_weight: float = 0.15,
) -> float:
    """Combine similarity, recency, and importance into a single score."""
    hours_ago = (datetime.utcnow() - memory.last_accessed).total_seconds() / 3600
    recency_score = math.exp(-0.01 * hours_ago)  # exponential decay

    return (
        (1 - recency_weight - importance_weight) * similarity
        + recency_weight * recency_score
        + importance_weight * memory.importance
    )

This formula ensures that recent, important memories rank higher when similarity scores are close.

Memory Consolidation

Over time, a semantic memory store accumulates redundant or overlapping entries. Consolidation merges similar memories to keep the store efficient.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

async def consolidate_memories(
    store: SemanticMemoryStore,
    merge_threshold: float = 0.88,
) -> int:
    """Merge highly similar memories to reduce redundancy."""
    merged_count = 0
    skip_indices = set()

    for i, mem_a in enumerate(store.memories):
        if i in skip_indices:
            continue
        for j, mem_b in enumerate(store.memories[i + 1:], start=i + 1):
            if j in skip_indices:
                continue
            sim = cosine_similarity(mem_a.embedding, mem_b.embedding)
            if sim >= merge_threshold:
                # Keep the more important or more recently accessed one
                if mem_b.importance > mem_a.importance:
                    mem_a.content = mem_b.content
                    mem_a.embedding = mem_b.embedding
                    mem_a.importance = max(mem_a.importance, mem_b.importance)
                mem_a.access_count += mem_b.access_count
                skip_indices.add(j)
                merged_count += 1

    store.memories = [
        m for i, m in enumerate(store.memories) if i not in skip_indices
    ]
    return merged_count

FAQ

How do I choose the right similarity threshold?

Start with 0.7 for general retrieval and tune based on your data. Lower thresholds (0.5-0.6) cast a wider net but include more noise. Higher thresholds (0.8+) are more precise but may miss relevant matches. Test with real queries from your domain and adjust.

Are there alternatives to OpenAI embeddings?

Yes. Open-source models like sentence-transformers/all-MiniLM-L6-v2 run locally with no API costs. Cohere and Voyage AI also offer embedding APIs. The choice depends on your latency, cost, and accuracy requirements.

How do I handle memory that becomes outdated?

Attach a timestamp and optionally a TTL (time-to-live) to each memory. Periodically sweep for expired entries. For facts that change — like a user's address — use the duplicate detection logic to overwrite the old entry rather than creating a conflicting one.


#SemanticMemory #Embeddings #VectorSearch #AIAgents #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Agentic AI

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Beyond single-shot RAG — agentic RAG with LangGraph that re-retrieves, self-grades, and rewrites queries. With evals that catch silent retrieval drift.

Agentic AI

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

How LangGraph's StateGraph, channels, and reducers actually work — with a working multi-step agent, eval hooks at every node, and the patterns that survive production.

Agentic AI

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

Use LangGraph's checkpointer to make agents resumable across crashes and human-in-the-loop pauses, then replay any checkpoint into your eval pipeline.

Agentic AI

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

Build a production RAG agent with LangChain, then measure faithfulness, answer relevance, and context precision with RAGAS. The four metrics that matter and how to wire them up.

Agentic AI

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.

Agentic AI

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

Handoffs done right — when one agent should hand control to another, how to preserve context, and how to evaluate the handoff decision itself.