Skip to content
Learn Agentic AI
Learn Agentic AI13 min read11 views

Chat Agent Context Management: Maintaining Coherent Multi-Turn Conversations

Master the techniques for managing conversation context in chat agents, including context window optimization, message pruning strategies, summarization, and topic tracking for coherent multi-turn interactions.

The Context Window Problem

Every LLM has a finite context window. GPT-4o supports 128K tokens, Claude supports up to 200K, but even these generous limits get consumed quickly in production chat agents. A busy customer support conversation with tool calls, system prompts, and previous messages can easily hit 50K tokens within 20 turns. Without active context management, your agent either crashes with a token limit error or starts losing track of earlier conversation details.

Context management is the discipline of deciding what information the model sees at each turn. Get it right, and your agent maintains coherent conversations across dozens of turns. Get it wrong, and users experience an agent that forgets what they said three messages ago.

Strategy 1: Sliding Window with Priority

The simplest approach is a sliding window — keep the last N messages and drop everything else. But naive truncation drops important context. A better approach assigns priority levels:

flowchart TD
    MSG(["New message"])
    WORKING["Working memory<br/>rolling window"]
    EPISODIC[("Episodic memory<br/>past sessions")]
    SEMANTIC[("Semantic memory<br/>facts and preferences")]
    SUM["Summarizer<br/>compresses old turns"]
    ROUTER{"Retrieve<br/>needed memories"}
    PROMPT["Assembled context"]
    LLM["LLM"]
    UPD["Memory updater<br/>writes new facts"]
    MSG --> WORKING --> ROUTER
    ROUTER -->|Past sessions| EPISODIC
    ROUTER -->|User facts| SEMANTIC
    EPISODIC --> SUM --> PROMPT
    SEMANTIC --> PROMPT
    WORKING --> PROMPT --> LLM --> UPD
    UPD --> EPISODIC
    UPD --> SEMANTIC
    style ROUTER fill:#4f46e5,stroke:#4338ca,color:#fff
    style LLM fill:#f59e0b,stroke:#d97706,color:#1f2937
    style EPISODIC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style SEMANTIC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
from dataclasses import dataclass, field
from enum import IntEnum

class Priority(IntEnum):
    SYSTEM = 0      # Always keep
    PINNED = 1      # User-critical context
    RECENT = 2      # Last N messages
    HISTORICAL = 3  # Older messages, drop first

@dataclass
class ContextMessage:
    role: str
    content: str
    priority: Priority
    token_count: int

class ContextManager:
    def __init__(self, max_tokens: int = 8000):
        self.max_tokens = max_tokens
        self.messages: list[ContextMessage] = []

    def add_message(self, role: str, content: str, priority: Priority = Priority.RECENT):
        tokens = len(content.split()) * 1.3  # Rough estimate
        self.messages.append(ContextMessage(role, content, priority, int(tokens)))

    def build_context(self) -> list[dict]:
        # Sort by priority (system first, historical last)
        sorted_msgs = sorted(self.messages, key=lambda m: m.priority)
        result = []
        used_tokens = 0

        for msg in sorted_msgs:
            if used_tokens + msg.token_count <= self.max_tokens:
                result.append({"role": msg.role, "content": msg.content})
                used_tokens += msg.token_count

        # Restore chronological order for the LLM
        return sorted(result, key=lambda m: self.messages.index(
            next(x for x in self.messages if x.content == m["content"])
        ))

The system prompt always stays. Pinned messages — things like the user's name, account number, or current issue — survive pruning. Recent messages form the active conversation. Historical messages get dropped first when space runs low.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →

Strategy 2: Conversation Summarization

When a conversation grows long, summarize older turns instead of dropping them entirely. This preserves context at a fraction of the token cost:

import openai

async def summarize_conversation(messages: list[dict]) -> str:
    summary_prompt = (
        "Summarize the following conversation history in 2-3 sentences. "
        "Focus on: the user's main issue, any decisions made, "
        "and any pending actions. Be factual and concise."
    )
    response = await openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": summary_prompt},
            *messages,
        ],
        max_tokens=200,
    )
    return response.choices[0].message.content

class SummarizingContextManager:
    def __init__(self, max_tokens: int = 8000, summarize_threshold: int = 6000):
        self.max_tokens = max_tokens
        self.summarize_threshold = summarize_threshold
        self.messages: list[dict] = []
        self.summary: str | None = None

    async def add_and_manage(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})
        total_tokens = sum(len(m["content"].split()) for m in self.messages)

        if total_tokens * 1.3 > self.summarize_threshold:
            # Summarize older messages, keep last 4
            old_messages = self.messages[:-4]
            self.summary = await summarize_conversation(old_messages)
            self.messages = self.messages[-4:]

    def build_context(self, system_prompt: str) -> list[dict]:
        context = [{"role": "system", "content": system_prompt}]
        if self.summary:
            context.append({
                "role": "system",
                "content": f"Previous conversation summary: {self.summary}",
            })
        context.extend(self.messages)
        return context

The trick is choosing when to summarize. Set a threshold at roughly 75% of your token budget. When the conversation crosses that line, summarize everything except the last few messages.

Strategy 3: Topic Tracking

Track what topics have been discussed so the agent can reference earlier context without keeping every message:

from collections import defaultdict

class TopicTracker:
    def __init__(self):
        self.topics: dict[str, list[str]] = defaultdict(list)
        self.current_topic: str | None = None

    async def classify_topic(self, message: str) -> str:
        response = await openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{
                "role": "system",
                "content": (
                    "Classify this message into one topic category. "
                    "Return only the category name. Examples: "
                    "billing, technical_support, account, shipping, general"
                ),
            }, {
                "role": "user",
                "content": message,
            }],
            max_tokens=20,
        )
        return response.choices[0].message.content.strip().lower()

    async def track(self, role: str, content: str):
        topic = await self.classify_topic(content)
        self.topics[topic].append(f"{role}: {content}")
        self.current_topic = topic

    def get_relevant_context(self) -> str:
        if not self.current_topic:
            return ""
        relevant = self.topics[self.current_topic][-6:]
        return "\n".join(relevant)

Topic tracking is especially powerful for support agents where users switch between issues mid-conversation. The agent can pull in context about billing when the user returns to a billing question, even if several technical support messages intervened.

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Combining Strategies in TypeScript

Here is a TypeScript implementation that combines sliding window with summarization:

interface ManagedMessage {
  role: "user" | "assistant" | "system";
  content: string;
  timestamp: number;
  pinned: boolean;
}

class ConversationContext {
  private messages: ManagedMessage[] = [];
  private summary: string | null = null;
  private readonly maxTokens = 8000;

  addMessage(role: ManagedMessage["role"], content: string, pinned = false) {
    this.messages.push({
      role, content, timestamp: Date.now(), pinned,
    });
  }

  async compact(summarizer: (msgs: ManagedMessage[]) => Promise<string>) {
    const tokenEstimate = this.messages
      .reduce((sum, m) => sum + m.content.split(" ").length * 1.3, 0);

    if (tokenEstimate > this.maxTokens * 0.75) {
      const pinned = this.messages.filter((m) => m.pinned);
      const recent = this.messages.filter((m) => !m.pinned).slice(-4);
      const old = this.messages.filter(
        (m) => !m.pinned && !recent.includes(m)
      );
      this.summary = await summarizer(old);
      this.messages = [...pinned, ...recent];
    }
  }

  build(systemPrompt: string): Array<{ role: string; content: string }> {
    const ctx: Array<{ role: string; content: string }> = [
      { role: "system", content: systemPrompt },
    ];
    if (this.summary) {
      ctx.push({ role: "system", content: `Prior context: ${this.summary}` });
    }
    this.messages.forEach((m) => ctx.push({ role: m.role, content: m.content }));
    return ctx;
  }
}

FAQ

How do I count tokens accurately instead of estimating?

Use the tiktoken library for OpenAI models. Call tiktoken.encoding_for_model("gpt-4o") to get the tokenizer, then len(encoding.encode(text)) for exact counts. For Claude, use Anthropic's token counting API endpoint. Accurate counting prevents both wasted context space and unexpected truncation errors.

When should I summarize versus just truncate old messages?

Summarize when the conversation involves ongoing state — like a support ticket where the user described their problem early on and is now troubleshooting. Truncate when messages are mostly independent exchanges, like a FAQ bot where each question stands alone. The cost of a summarization call (latency and tokens) only pays off when the summary carries information the agent genuinely needs.

How do I handle tool call results in context management?

Tool call results can be verbose. Store the full result in your database but inject only a condensed version into the context. For example, if a database query returns 50 rows, summarize it as "Query returned 50 orders, most recent from March 15, total value $4,230." This preserves the key facts while saving thousands of tokens.


#ContextManagement #ConversationMemory #MultiTurn #LLM #ChatAgent #AgenticAI #LearnAI #AIEngineering

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.