LangChain Memory: ConversationBufferMemory, Summary, and Vector Store Memory

Why Agents Need Memory

Large language models are stateless. Each API call starts fresh with no knowledge of previous interactions. For multi-turn conversations or agents that need to reference past information, you must explicitly manage state. LangChain provides memory abstractions that handle this — storing conversation history, summarizing it, or persisting it in a vector store for semantic retrieval.

Understanding the tradeoffs between memory types is essential. Too much context fills your token window and increases costs. Too little context makes the assistant forget important details mid-conversation.

ConversationBufferMemory

The simplest memory type stores every message verbatim.

flowchart TD
    DOC(["Document"])
    CHUNK["Chunker<br/>recursive plus overlap"]
    EMB["Embedding model"]
    META["Attach metadata<br/>source, page, tenant"]
    INDEX[("HNSW or IVF index<br/>in vector store")]
    Q(["Query"])
    QEMB["Embed query"]
    SEARCH["ANN search<br/>cosine similarity"]
    FILTER["Metadata filter<br/>tenant or date"]
    HITS(["Top-k chunks"])
    DOC --> CHUNK --> EMB --> META --> INDEX
    Q --> QEMB --> SEARCH
    INDEX --> SEARCH --> FILTER --> HITS
    style INDEX fill:#4f46e5,stroke:#4338ca,color:#fff
    style HITS fill:#059669,stroke:#047857,color:#fff

from langchain.memory import ConversationBufferMemory
from langchain_openai import ChatOpenAI
from langchain.chains import ConversationChain

memory = ConversationBufferMemory(return_messages=True)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

chain = ConversationChain(llm=llm, memory=memory, verbose=True)

chain.invoke({"input": "My name is Alice."})
chain.invoke({"input": "What is my name?"})
# The model correctly responds "Alice" because it sees the full history

return_messages=True stores history as message objects rather than a single string, which is preferred for chat models. The downside is obvious: as the conversation grows, you eventually exceed the model's context window.

ConversationBufferWindowMemory

This variant keeps only the last k turns, discarding older messages.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

from langchain.memory import ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory(k=5, return_messages=True)

Setting k=5 retains the most recent 5 exchanges. This bounds token usage but means the agent will forget information from earlier in the conversation.

ConversationSummaryMemory

Instead of dropping old messages, this memory type summarizes the conversation history using an LLM. The summary is updated after each turn.

from langchain.memory import ConversationSummaryMemory
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

memory = ConversationSummaryMemory(
    llm=llm,
    return_messages=True,
)

# After many turns, instead of storing all messages,
# the memory holds a running summary like:
# "The user's name is Alice. She asked about Python decorators
#  and was interested in async patterns."

The tradeoff is that summarization costs extra LLM calls and may lose nuance. It works well for long conversations where the gist matters more than exact wording.

ConversationSummaryBufferMemory

This hybrid keeps recent messages in full while summarizing older ones. You set a max_token_limit — once the buffer exceeds that limit, the oldest messages are summarized.

from langchain.memory import ConversationSummaryBufferMemory

memory = ConversationSummaryBufferMemory(
    llm=llm,
    max_token_limit=500,
    return_messages=True,
)

This gives you the best of both worlds: precise recent context and compressed long-term context.

Vector Store Memory

For agents that need to recall specific facts from potentially thousands of past interactions, vector store memory embeds conversation snippets and retrieves them via semantic search.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

from langchain.memory import VectorStoreRetrieverMemory
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Create or load a vector store
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_texts([], embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

memory = VectorStoreRetrieverMemory(retriever=retriever)

# Save facts
memory.save_context(
    {"input": "I prefer Python over JavaScript"},
    {"output": "Noted, you prefer Python."},
)
memory.save_context(
    {"input": "My project deadline is March 30th"},
    {"output": "Got it, your deadline is March 30th."},
)

# Later, only semantically relevant memories are retrieved
relevant = memory.load_memory_variables(
    {"input": "What programming language should we use?"}
)
print(relevant)
# Returns the Python preference memory, not the deadline memory

Vector store memory scales to thousands of interactions because retrieval is based on relevance, not recency.

Memory with LCEL Chains

In modern LCEL-based chains, you typically manage history explicitly using RunnableWithMessageHistory.

from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import ChatMessageHistory
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

store = {}

def get_session_history(session_id: str):
    if session_id not in store:
        store[session_id] = ChatMessageHistory()
    return store[session_id]

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    MessagesPlaceholder("history"),
    ("human", "{input}"),
])

chain = prompt | ChatOpenAI(model="gpt-4o-mini")

with_history = RunnableWithMessageHistory(
    chain,
    get_session_history,
    input_messages_key="input",
    history_messages_key="history",
)

# Each session maintains its own history
response = with_history.invoke(
    {"input": "My name is Bob"},
    config={"configurable": {"session_id": "user-123"}},
)

This approach gives you full control over where history is stored — in memory, Redis, a database, or any custom backend.

FAQ

Which memory type should I use for a production chatbot?

For most production chatbots, start with ConversationSummaryBufferMemory or the LCEL RunnableWithMessageHistory with a persistent backend like Redis or PostgreSQL. The summary buffer approach balances cost, context window usage, and information retention. For applications that need to recall specific facts across many sessions, add vector store memory.

Can I combine multiple memory types?

Yes. A common pattern is to use buffer memory for the current conversation and vector store memory for cross-session recall. You can inject both into the prompt — recent messages from the buffer and relevant past facts from the vector store.

How do I persist memory across server restarts?

In-memory stores like ChatMessageHistory are lost on restart. Use persistent backends: RedisChatMessageHistory, SQLChatMessageHistory, or implement a custom BaseChatMessageHistory class that reads from and writes to your database.

#LangChain #Memory #ConversationalAI #VectorStore #Python #AgenticAI #LearnAI #AIEngineering

LangChain Memory: ConversationBufferMemory, Summary, and Vector Store Memory

Why Agents Need Memory

ConversationBufferMemory

ConversationBufferWindowMemory

ConversationSummaryMemory

ConversationSummaryBufferMemory

Vector Store Memory

Memory with LCEL Chains

FAQ

Which memory type should I use for a production chatbot?

Can I combine multiple memory types?

How do I persist memory across server restarts?

Try CallSphere AI Voice Agents

Related Articles You May Like

Building OpenAI Realtime Voice Agents with an Eval Pipeline (2026)

Voice Agent Quality Metrics in 2026: WER, Latency, Grounding, and the Ones Most Teams Miss

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

Streaming Agent Responses with OpenAI Agents SDK and LangChain in 2026