Skip to content
Learn Agentic AI
Learn Agentic AI11 min read15 views

UFO Memory and Learning: How the Agent Remembers Successful Task Patterns

Learn how Microsoft UFO's experience learning system stores successful task executions, retrieves relevant past patterns for new tasks, and optimizes performance through memory-based action prediction.

Why Agent Memory Matters

Without memory, every UFO task starts from scratch. The agent has no recollection of successfully completing the same task yesterday or of discovering that a particular sequence of clicks is the fastest way to apply a filter in Excel. Every execution involves the same number of LLM calls, the same trial-and-error, and the same cost.

UFO addresses this with an experience learning system that records successful task executions and retrieves relevant experiences when handling new tasks. This is functionally a Retrieval-Augmented Generation (RAG) system applied to UI automation memory.

How Experience Learning Works

UFO's memory system operates in three phases: record, index, and retrieve.

flowchart LR
    Q(["User query"])
    EMB["Embed query<br/>text-embedding-3"]
    VEC[("Vector DB<br/>pgvector or Pinecone")]
    RET["Top-k retrieval<br/>k = 8"]
    PROMPT["Augmented prompt<br/>system plus context"]
    LLM["LLM generation<br/>Claude or GPT"]
    CITE["Inline citations<br/>and page anchors"]
    OUT(["Grounded answer"])
    Q --> EMB --> VEC --> RET --> PROMPT --> LLM --> CITE --> OUT
    style EMB fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style VEC fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style LLM fill:#4f46e5,stroke:#4338ca,color:#fff
    style OUT fill:#059669,stroke:#047857,color:#fff

Phase 1: Recording Experiences

After a task completes successfully, UFO serializes the entire execution trace — every observation, action, and outcome — into a structured experience record:

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live Demo →
from dataclasses import dataclass, field
from datetime import datetime
import json

@dataclass
class TaskExperience:
    """A recorded successful task execution."""
    task_id: str
    task_description: str
    application: str
    steps: list[dict]
    total_steps: int
    start_time: datetime
    end_time: datetime
    success: bool
    metadata: dict = field(default_factory=dict)

    def to_dict(self) -> dict:
        return {
            "task_id": self.task_id,
            "task_description": self.task_description,
            "application": self.application,
            "steps": self.steps,
            "total_steps": self.total_steps,
            "duration_seconds": (self.end_time - self.start_time).total_seconds(),
            "success": self.success,
            "metadata": self.metadata,
        }

def record_experience(task: str, execution_trace: list[dict]) -> TaskExperience:
    """Record a successful task execution for future reference."""
    experience = TaskExperience(
        task_id=generate_uuid(),
        task_description=task,
        application=execution_trace[0].get("application", "Unknown"),
        steps=[
            {
                "step_number": step["step"],
                "observation": step["thought"],
                "action_type": step["action_type"],
                "target_control": step.get("control_text", ""),
                "parameters": step.get("parameters", {}),
                "result": step.get("result", "success"),
            }
            for step in execution_trace
        ],
        total_steps=len(execution_trace),
        start_time=execution_trace[0]["timestamp"],
        end_time=execution_trace[-1]["timestamp"],
        success=True,
    )

    # Save to disk
    save_path = f"experience_db/{experience.task_id}.json"
    with open(save_path, "w") as f:
        json.dump(experience.to_dict(), f, indent=2, default=str)

    return experience

Phase 2: Indexing With Embeddings

Stored experiences are indexed using text embeddings so they can be retrieved by semantic similarity:

from openai import OpenAI
import numpy as np

client = OpenAI()

def create_experience_index(experiences_dir: str) -> dict:
    """Build a vector index of task experiences."""
    index = {"embeddings": [], "task_ids": [], "descriptions": []}

    for exp_file in Path(experiences_dir).glob("*.json"):
        with open(exp_file) as f:
            exp = json.load(f)

        # Create embedding from task description + key actions
        summary = build_experience_summary(exp)

        response = client.embeddings.create(
            model="text-embedding-3-small",
            input=summary
        )

        index["embeddings"].append(response.data[0].embedding)
        index["task_ids"].append(exp["task_id"])
        index["descriptions"].append(summary)

    # Convert to numpy for efficient similarity search
    index["embeddings"] = np.array(index["embeddings"])
    return index

def build_experience_summary(experience: dict) -> str:
    """Create a searchable summary of an experience."""
    steps_summary = " -> ".join(
        f"{s['action_type']}({s['target_control']})"
        for s in experience["steps"][:10]
    )
    return (
        f"Task: {experience['task_description']} "
        f"App: {experience['application']} "
        f"Steps: {steps_summary}"
    )

Phase 3: Retrieving Relevant Experiences

When a new task arrives, UFO searches the index for similar past experiences:

def retrieve_relevant_experiences(
    new_task: str,
    index: dict,
    top_k: int = 3,
    similarity_threshold: float = 0.75,
) -> list[dict]:
    """Find past experiences relevant to the new task."""
    # Embed the new task
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=new_task
    )
    query_embedding = np.array(response.data[0].embedding)

    # Cosine similarity search
    similarities = np.dot(index["embeddings"], query_embedding) / (
        np.linalg.norm(index["embeddings"], axis=1)
        * np.linalg.norm(query_embedding)
    )

    # Filter by threshold and get top-k
    candidates = [
        (i, sim) for i, sim in enumerate(similarities)
        if sim >= similarity_threshold
    ]
    candidates.sort(key=lambda x: x[1], reverse=True)
    top_candidates = candidates[:top_k]

    # Load full experience records
    results = []
    for idx, score in top_candidates:
        task_id = index["task_ids"][idx]
        with open(f"experience_db/{task_id}.json") as f:
            exp = json.load(f)
        exp["similarity_score"] = float(score)
        results.append(exp)

    return results

Injecting Memory Into the Prompt

Retrieved experiences are included in the GPT-4V prompt as few-shot examples, giving the model a proven action sequence to follow:

def build_prompt_with_memory(
    task: str,
    screenshot: str,
    controls: list[dict],
    relevant_experiences: list[dict],
) -> str:
    """Build the action prompt enriched with past experiences."""
    experience_text = ""
    if relevant_experiences:
        experience_text = "\n\nRelevant past experiences:\n"
        for exp in relevant_experiences:
            experience_text += f"\nTask: {exp['task_description']}\n"
            experience_text += f"Similarity: {exp['similarity_score']:.2f}\n"
            experience_text += "Successful steps:\n"
            for step in exp["steps"]:
                experience_text += (
                    f"  {step['step_number']}. {step['action_type']}"
                    f"({step['target_control']}) - {step['observation']}\n"
                )

    return f"""Task: {task}
{experience_text}

Based on the annotated screenshot and any relevant past experience,
select the next action. Past experiences are suggestions — adapt them
to the current UI state if controls have changed."""

Performance Impact

Memory reduces both cost and execution time:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

  • Fewer exploratory actions — the agent follows proven paths instead of experimenting
  • Lower token usage — successful patterns provide shorter reasoning chains
  • Better first-attempt accuracy — relevant examples guide the model toward correct actions
# Measuring memory impact
def compare_with_without_memory(task: str):
    """Run the same task with and without memory retrieval."""
    # Without memory
    result_no_mem = run_ufo_task(task, use_memory=False)

    # With memory
    result_with_mem = run_ufo_task(task, use_memory=True)

    print(f"Without memory: {result_no_mem['steps']} steps, "
          f"${result_no_mem['cost']:.3f}")
    print(f"With memory: {result_with_mem['steps']} steps, "
          f"${result_with_mem['cost']:.3f}")
    print(f"Step reduction: "
          f"{(1 - result_with_mem['steps']/result_no_mem['steps'])*100:.0f}%")

In practice, memory-augmented execution typically reduces step count by 20-40% for tasks similar to previously recorded experiences.

FAQ

How much storage does the experience database require?

Each experience record is a JSON file of 5-50 KB depending on task complexity. The embeddings index adds roughly 6 KB per experience (1536-dimensional float32 vector). A database of 1,000 experiences takes approximately 50-60 MB total — negligible on modern systems.

Does UFO learn from failed tasks?

By default, UFO only records successful completions. However, you can configure it to also record failures and use them as negative examples in the prompt — telling the model "this approach was tried and failed" to steer it toward alternative strategies.

Can experiences transfer between machines with different screen resolutions?

Experiences are stored as abstract action sequences (click control type X, type text Y) rather than pixel coordinates, so they transfer well between machines. The vision model adapts to different layouts and resolutions when following experience-suggested action sequences.


#AgentMemory #ExperienceLearning #RAG #TaskPatterns #MicrosoftUFO #PerformanceOptimization #AIMemory #VectorSearch

Share

Try CallSphere AI Voice Agents

See how AI voice agents work for your industry. Live demo available -- no signup required.

Related Articles You May Like

AI Engineering

Build a Chat Agent with Haystack RAG + Open LLM (Llama 3.2, 2026)

Haystack 2.7's Agent component plus an Ollama-served Llama 3.2 gives you tool-calling RAG with citations. Here's a complete pipeline against your own document store.

Agentic AI

Evaluating Agent Memory: Recall, Precision, and the Eval Pipeline Most Teams Don't Build

Memory is supposed to make agents better — but does it? Build a memory eval pipeline that measures recall, precision, contradiction rate, and the freshness/staleness tradeoff.

Agentic AI

Production RAG Agents with LangChain and RAGAS Evaluation in 2026

Build a production RAG agent with LangChain, then measure faithfulness, answer relevance, and context precision with RAGAS. The four metrics that matter and how to wire them up.

Agentic AI

Agentic RAG with LangGraph: Iterative Retrieval, Self-Correction, and Eval Pipelines

Beyond single-shot RAG — agentic RAG with LangGraph that re-retrieves, self-grades, and rewrites queries. With evals that catch silent retrieval drift.

Agentic AI

Agent Memory in LangGraph 2026: Short-Term, Long-Term, and the Patterns That Survive Production

How short-term (thread-scoped) and long-term (cross-thread) memory actually work in LangGraph, with code, schemas, and the eviction policies that keep cost predictable.

AI Engineering

Cognee: Knowledge-Graph Memory for Agents — A Getting-Started Guide

Cognee builds and queries a knowledge graph from your unstructured data automatically. A walkthrough from install to your first agent integration in production.