Skip to content
Learn Agentic AI
Learn Agentic AI18 min read16 views

Agent Memory Systems: Short-Term, Long-Term, and Episodic Memory for AI Agents

Technical deep dive into agent memory architectures covering conversation context, vector DB persistence, and experience replay with implementation code for production systems.

Why Memory Transforms Agents from Stateless to Intelligent

A stateless AI agent answers each question in isolation. It cannot remember your name, your preferences, what you discussed yesterday, or the lessons it learned from past mistakes. This is the difference between a search engine and a colleague.

Memory is the architectural component that bridges this gap. By implementing structured memory systems, agents accumulate knowledge across conversations, learn from interactions, and provide increasingly personalized and accurate responses over time.

The human brain uses distinct memory systems — working memory for immediate context, long-term memory for persistent knowledge, and episodic memory for specific experiences. Production AI agents benefit from the same separation. Each type serves a different purpose, has different storage characteristics, and requires different retrieval strategies.

Short-Term Memory: The Conversation Context

Short-term memory is the simplest form: it is the conversation history passed to the LLM with each request. Every message, tool call, and response in the current session forms the agent's immediate context.

flowchart TD
    DOC(["Document"])
    CHUNK["Chunker<br/>recursive plus overlap"]
    EMB["Embedding model"]
    META["Attach metadata<br/>source, page, tenant"]
    INDEX[("HNSW or IVF index<br/>in vector store")]
    Q(["Query"])
    QEMB["Embed query"]
    SEARCH["ANN search<br/>cosine similarity"]
    FILTER["Metadata filter<br/>tenant or date"]
    HITS(["Top-k chunks"])
    DOC --> CHUNK --> EMB --> META --> INDEX
    Q --> QEMB --> SEARCH
    INDEX --> SEARCH --> FILTER --> HITS
    style INDEX fill:#4f46e5,stroke:#4338ca,color:#fff
    style HITS fill:#059669,stroke:#047857,color:#fff
from dataclasses import dataclass, field
from typing import Any
import time

@dataclass
class Message:
    role: str  # "user", "assistant", "tool"
    content: str
    timestamp: float = field(default_factory=time.time)
    metadata: dict[str, Any] = field(default_factory=dict)

class ShortTermMemory:
    def __init__(self, max_tokens: int = 120_000):
        self.messages: list[Message] = []
        self.max_tokens = max_tokens

    def add(self, role: str, content: str, **metadata):
        self.messages.append(
            Message(role=role, content=content, metadata=metadata)
        )
        self._enforce_limit()

    def get_context(self) -> list[dict]:
        return [
            {"role": m.role, "content": m.content}
            for m in self.messages
        ]

    def _enforce_limit(self):
        """Sliding window: remove oldest messages when over limit."""
        total_tokens = sum(
            self._estimate_tokens(m.content) for m in self.messages
        )
        while total_tokens > self.max_tokens and len(self.messages) > 1:
            removed = self.messages.pop(0)
            total_tokens -= self._estimate_tokens(removed.content)

    def _estimate_tokens(self, text: str) -> int:
        # Rough estimate: 1 token per 4 characters
        return len(text) // 4

    def summarize_and_compress(self, summarizer_fn) -> str:
        """Compress older messages into a summary to save tokens."""
        if len(self.messages) < 10:
            return ""
        old_messages = self.messages[:len(self.messages) // 2]
        text = "\n".join(f"{m.role}: {m.content}" for m in old_messages)
        summary = summarizer_fn(text)
        # Replace old messages with summary
        self.messages = [
            Message(role="system", content=f"Previous context: {summary}")
        ] + self.messages[len(self.messages) // 2:]
        return summary

Short-Term Memory Strategies

Sliding window is the simplest approach: keep the most recent N messages or N tokens. Old messages are dropped. This works for task-oriented agents where historical context fades in relevance.

Summarization compresses older parts of the conversation into a summary that takes fewer tokens. The summary is prepended as a system message. This preserves key decisions and context while saving token budget.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Selective retention keeps all messages that contain tool calls, decisions, or user preferences, while summarizing or dropping purely conversational messages. This preserves actionable context.

Long-Term Memory: Persistent Knowledge with Vector Databases

Long-term memory persists across conversations. When a user returns days later, the agent should remember their preferences, past interactions, and accumulated knowledge. Vector databases are the standard storage mechanism.

import hashlib
import json
from datetime import datetime

class LongTermMemory:
    def __init__(self, vector_store, embedding_fn, namespace: str):
        self.vector_store = vector_store  # Pinecone, Chroma, Qdrant
        self.embedding_fn = embedding_fn
        self.namespace = namespace

    async def store(self, content: str, metadata: dict = None):
        """Store a memory with its embedding."""
        memory_id = hashlib.sha256(
            content.encode()
        ).hexdigest()[:16]
        embedding = await self.embedding_fn(content)
        record = {
            "id": memory_id,
            "values": embedding,
            "metadata": {
                "content": content,
                "timestamp": datetime.utcnow().isoformat(),
                "namespace": self.namespace,
                **(metadata or {}),
            },
        }
        await self.vector_store.upsert([record])
        return memory_id

    async def recall(self, query: str, top_k: int = 5,
                     min_score: float = 0.7) -> list[dict]:
        """Retrieve relevant memories for a query."""
        query_embedding = await self.embedding_fn(query)
        results = await self.vector_store.query(
            vector=query_embedding,
            top_k=top_k,
            filter={"namespace": self.namespace},
            include_metadata=True,
        )
        return [
            {
                "content": r["metadata"]["content"],
                "score": r["score"],
                "timestamp": r["metadata"]["timestamp"],
            }
            for r in results
            if r["score"] >= min_score
        ]

    async def forget(self, memory_id: str):
        """Delete a specific memory (GDPR compliance)."""
        await self.vector_store.delete(ids=[memory_id])

What to Store in Long-Term Memory

Not every message belongs in long-term memory. Store:

  • User preferences: "I prefer Python over JavaScript", "My timezone is PST"
  • Key decisions: "We decided to use PostgreSQL for the user service"
  • Learned facts: "The company's fiscal year starts in April"
  • Interaction outcomes: "The refund was processed successfully on 2026-03-15"

Do not store: casual acknowledgments, error messages, routine confirmations, or verbatim conversation logs.

Retrieval Strategies

Semantic search retrieves memories whose embeddings are closest to the current query. This is the default and handles most cases well.

Temporal weighting boosts recent memories and decays older ones. Multiply the similarity score by a time decay factor: score * decay_factor^(days_since_stored).

Categorical filtering uses metadata tags to narrow the search space. When the agent is handling a billing question, filter memories to the "billing" category before running semantic search.

Episodic Memory: Learning from Experience

Episodic memory stores complete interaction episodes — the full sequence of events from initial request to resolution. Unlike long-term memory which stores atomic facts, episodic memory preserves the narrative structure of past experiences.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

from dataclasses import dataclass, field
from typing import Any

@dataclass
class Episode:
    episode_id: str
    trigger: str  # What initiated this episode
    steps: list[dict] = field(default_factory=list)
    outcome: str = ""  # "success", "failure", "escalation"
    lessons: list[str] = field(default_factory=list)
    duration_seconds: float = 0.0

class EpisodicMemory:
    def __init__(self, storage, embedding_fn):
        self.storage = storage
        self.embedding_fn = embedding_fn
        self.current_episode: Episode | None = None

    def start_episode(self, episode_id: str, trigger: str):
        self.current_episode = Episode(
            episode_id=episode_id, trigger=trigger
        )

    def record_step(self, action: str, result: Any,
                    reasoning: str = ""):
        if self.current_episode:
            self.current_episode.steps.append({
                "action": action,
                "result": str(result),
                "reasoning": reasoning,
                "timestamp": time.time(),
            })

    async def end_episode(self, outcome: str,
                          lessons: list[str] = None):
        if not self.current_episode:
            return
        self.current_episode.outcome = outcome
        self.current_episode.lessons = lessons or []
        if self.current_episode.steps:
            self.current_episode.duration_seconds = (
                self.current_episode.steps[-1]["timestamp"]
                - self.current_episode.steps[0]["timestamp"]
            )
        # Store episode for future retrieval
        episode_text = self._serialize_episode(self.current_episode)
        embedding = await self.embedding_fn(episode_text)
        await self.storage.store(
            id=self.current_episode.episode_id,
            embedding=embedding,
            data=self.current_episode.__dict__,
        )
        self.current_episode = None

    async def recall_similar_episodes(self, situation: str,
                                       top_k: int = 3) -> list[dict]:
        """Find past episodes similar to the current situation."""
        query_embedding = await self.embedding_fn(situation)
        return await self.storage.query(
            vector=query_embedding, top_k=top_k
        )

    def _serialize_episode(self, episode: Episode) -> str:
        steps_text = " -> ".join(
            s["action"] for s in episode.steps
        )
        return (
            f"Trigger: {episode.trigger}. "
            f"Steps: {steps_text}. "
            f"Outcome: {episode.outcome}. "
            f"Lessons: {'; '.join(episode.lessons)}"
        )

Experience Replay

The most powerful use of episodic memory is experience replay: when the agent encounters a new situation, it retrieves similar past episodes and uses them as few-shot examples in its prompt.

async def handle_with_experience(agent, user_message: str,
                                  episodic_memory: EpisodicMemory):
    similar = await episodic_memory.recall_similar_episodes(
        user_message, top_k=2
    )
    experience_context = ""
    if similar:
        experience_context = "\nRelevant past experiences:\n"
        for ep in similar:
            experience_context += (
                f"- Situation: {ep['trigger']}\n"
                f"  Approach: {' -> '.join(s['action'] for s in ep['steps'])}\n"
                f"  Outcome: {ep['outcome']}\n"
                f"  Lessons: {'; '.join(ep.get('lessons', []))}\n"
            )

    enhanced_prompt = f"{agent.instructions}\n{experience_context}"
    # Run agent with enhanced context
    return await agent.run(user_message, instructions=enhanced_prompt)

This pattern allows agents to improve over time without retraining. Failed episodes teach the agent to avoid certain approaches. Successful episodes reinforce effective strategies.

Combining All Three Memory Types

A production agent uses all three memory types together:

  1. Short-term memory holds the current conversation — the user's messages, tool results, and the agent's responses
  2. Long-term memory is queried at the start of each conversation to inject relevant user preferences and past knowledge
  3. Episodic memory is queried when the agent encounters a problem, providing past experiences as guidance

The memory orchestration layer decides which memories to inject and in what priority. A common pattern is to allocate token budgets: 60% for the current conversation (short-term), 25% for long-term knowledge, and 15% for episodic examples.

FAQ

How do you handle memory conflicts between short-term and long-term?

Short-term memory always takes precedence. If the user said "I now prefer TypeScript" in the current conversation, that overrides a long-term memory saying "User prefers Python." After the conversation ends, the new preference should be stored in long-term memory, replacing or annotating the old one.

What embedding model should you use for agent memory?

For most use cases, OpenAI's text-embedding-3-large or Cohere's embed-v4 provide the best balance of quality and cost. For high-throughput systems processing millions of memories, smaller models like text-embedding-3-small reduce latency and cost with minimal quality loss for retrieval tasks.

How do you handle GDPR and data deletion for agent memories?

Every memory must be tagged with a user identifier. Implement a forget_user(user_id) function that deletes all memories associated with that user from both the vector store and any backing storage. This must include short-term conversation logs, long-term memory entries, and episodic records. Audit this functionality regularly.

Does episodic memory actually improve agent performance?

Yes, measurably. In A/B tests across customer support and coding assistant use cases, agents with episodic memory show 15-25% higher task completion rates and 30% fewer repeated errors compared to agents with only short-term and long-term memory. The key is curating high-quality episodes — storing every interaction degrades retrieval quality.

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

Agentic AI

Agent Memory in LangGraph 2026: Short-Term, Long-Term, and the Patterns That Survive Production

How short-term (thread-scoped) and long-term (cross-thread) memory actually work in LangGraph, with code, schemas, and the eviction policies that keep cost predictable.

Agentic AI

Evaluating Agent Memory: Recall, Precision, and the Eval Pipeline Most Teams Don't Build

Memory is supposed to make agents better — but does it? Build a memory eval pipeline that measures recall, precision, contradiction rate, and the freshness/staleness tradeoff.

Agentic AI

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Step-by-step build of a working agent with the OpenAI Agents SDK — Agent class, tools, handoffs, tracing — plus an eval pipeline that catches regressions before merge.

Agentic AI

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

Use LangGraph's checkpointer to make agents resumable across crashes and human-in-the-loop pauses, then replay any checkpoint into your eval pipeline.

Agentic AI

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

How LangGraph's StateGraph, channels, and reducers actually work — with a working multi-step agent, eval hooks at every node, and the patterns that survive production.

Agentic AI

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

Handoffs done right — when one agent should hand control to another, how to preserve context, and how to evaluate the handoff decision itself.