Self-Reflection in AI Agents: Building Systems That Learn from Mistakes

The Problem with One-Shot Execution

Most AI agents generate a response and move on. If the output is wrong, incomplete, or poorly formatted, the user has to notice the problem and ask for a correction. This is fragile. Humans miss errors, and the feedback loop is slow.

Self-reflection changes this by adding an internal quality check. Before returning a result to the user, the agent evaluates its own output, identifies weaknesses, and improves it — all within the same execution loop. The result is higher quality output with fewer round trips.

The Basic Critique Loop

The simplest self-reflection pattern uses two LLM calls: one to generate, one to critique.

Hear it before you finish reading

Talk to a live CallSphere AI voice agent in your browser — 60 seconds, no signup.

Try Live →

Try Live Demo →

flowchart LR
    INPUT(["User intent"])
    PARSE["Parse plus<br/>classify"]
    PLAN["Plan and tool<br/>selection"]
    AGENT["Agent loop<br/>LLM plus tools"]
    GUARD{"Guardrails<br/>and policy"}
    EXEC["Execute and<br/>verify result"]
    OBS[("Trace and metrics")]
    OUT(["Outcome plus<br/>next action"])
    INPUT --> PARSE --> PLAN --> AGENT --> GUARD
    GUARD -->|Pass| EXEC --> OUT
    GUARD -->|Fail| AGENT
    AGENT --> OBS
    style AGENT fill:#4f46e5,stroke:#4338ca,color:#fff
    style GUARD fill:#f59e0b,stroke:#d97706,color:#1f2937
    style OBS fill:#ede9fe,stroke:#7c3aed,color:#1e1b4b
    style OUT fill:#059669,stroke:#047857,color:#fff

from openai import OpenAI

client = OpenAI()

def generate_with_reflection(task: str, max_reflections: int = 3) -> str:
    # Step 1: Generate initial output
    draft = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "You are a technical writer."},
            {"role": "user", "content": task},
        ],
    ).choices[0].message.content

    for i in range(max_reflections):
        # Step 2: Critique the output
        critique = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": (
                    "You are a critical reviewer. Evaluate the following output for:"
                    "\n1. Factual accuracy"
                    "\n2. Completeness (does it address all aspects of the task?)"
                    "\n3. Clarity and structure"
                    "\n4. Any errors or inconsistencies"
                    "\nIf the output is satisfactory, respond with exactly: APPROVED"
                    "\nOtherwise, list specific improvements needed."
                )},
                {"role": "user", "content": f"Task: {task}\n\nOutput:\n{draft}"},
            ],
        ).choices[0].message.content

        # If approved, return the draft
        if "APPROVED" in critique.upper():
            return draft

        # Step 3: Improve based on critique
        draft = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "You are a technical writer. "
                 "Revise your output based on the feedback provided."},
                {"role": "user", "content": (
                    f"Original task: {task}\n\n"
                    f"Your previous draft:\n{draft}\n\n"
                    f"Reviewer feedback:\n{critique}\n\n"
                    "Please produce an improved version addressing all feedback."
                )},
            ],
        ).choices[0].message.content

    return draft  # Return best attempt after max reflections

Each iteration produces a measurably better output because the critique identifies specific issues that the revision addresses. In practice, most outputs reach "APPROVED" quality within 1-2 reflection cycles.

Score-and-Improve Pattern

For more structured reflection, assign numerical scores to specific quality dimensions. This gives you quantifiable improvement tracking and clearer termination criteria.

import json

def score_and_improve(task: str, output: str, threshold: float = 8.0) -> dict:
    """Score output on multiple dimensions, improve if below threshold."""

    # Score the output
    scoring_response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": (
                "Score the following output on a scale of 1-10 for each dimension. "
                "Return JSON with scores and brief justifications.\n"
                "Dimensions: accuracy, completeness, clarity, actionability"
            )},
            {"role": "user", "content": f"Task: {task}\nOutput: {output}"},
        ],
        response_format={"type": "json_object"},
    )

    scores = json.loads(scoring_response.choices[0].message.content)

    # Calculate average score
    dimensions = ["accuracy", "completeness", "clarity", "actionability"]
    avg_score = sum(scores.get(d, {}).get("score", 0) for d in dimensions) / len(dimensions)

    if avg_score >= threshold:
        return {"output": output, "scores": scores, "improved": False}

    # Identify weak dimensions for targeted improvement
    weak_dims = [d for d in dimensions if scores.get(d, {}).get("score", 0) < threshold]
    feedback = "\n".join(
        f"- {d}: {scores[d].get('justification', 'Needs improvement')}"
        for d in weak_dims
    )

    # Generate improved output focusing on weak areas
    improved = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": "Improve the output, focusing on the weak areas."},
            {"role": "user", "content": (
                f"Task: {task}\nCurrent output: {output}\n\n"
                f"Areas needing improvement:\n{feedback}"
            )},
        ],
    ).choices[0].message.content

    return {"output": improved, "scores": scores, "improved": True}

Retry-with-Feedback for Tool Failures

Self-reflection is not just for text generation. It is equally powerful for recovering from tool execution failures. Instead of blindly retrying, the agent reflects on why the tool call failed and adjusts its approach.

def reflective_tool_execution(agent_messages, tool_name, tool_args, max_retries=3):
    """Execute a tool with reflective retry on failure."""

    for attempt in range(max_retries):
        result = execute_tool(tool_name, tool_args)

        if "error" not in result:
            return result  # Success

        # Reflect on the failure
        reflection = client.chat.completions.create(
            model="gpt-4o",
            messages=agent_messages + [
                {"role": "system", "content": (
                    f"Your tool call to '{tool_name}' with args {json.dumps(tool_args)} "
                    f"failed with error: {result['error']}\n\n"
                    "Analyze why this failed and suggest corrected arguments. "
                    "Return JSON with 'analysis' and 'corrected_args' fields."
                )},
            ],
            response_format={"type": "json_object"},
        )

        reflection_data = json.loads(reflection.choices[0].message.content)
        tool_args = reflection_data.get("corrected_args", tool_args)

    return {"error": f"Failed after {max_retries} reflective retries"}

Building a Self-Improving Agent Loop

Combining reflection with the standard agent loop creates an agent that continuously improves within a single task execution:

Still reading? Stop comparing — try CallSphere live.

CallSphere ships complete AI voice agents per industry — 14 tools for healthcare, 10 agents for real estate, 4 specialists for salons. See how it actually handles a call before you book a demo.

Try Live Demo → Book 30-min Walkthrough See Pricing

def self_improving_agent(goal: str, tools: list, max_steps: int = 15) -> str:
    messages = [
        {"role": "system", "content": (
            "You are a careful agent. After completing a task, evaluate "
            "your own work before presenting it to the user. If your output "
            "has gaps or errors, fix them before responding."
        )},
        {"role": "user", "content": goal},
    ]

    for step in range(max_steps):
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=messages,
            tools=tools,
        )
        msg = response.choices[0].message
        messages.append(msg)

        if not msg.tool_calls:
            # Before returning, add a self-check
            check = client.chat.completions.create(
                model="gpt-4o",
                messages=messages + [{
                    "role": "user",
                    "content": (
                        "Review your response. Is it complete, accurate, and "
                        "fully addresses the original goal? If yes, say FINAL. "
                        "If not, explain what needs fixing."
                    ),
                }],
            ).choices[0].message.content

            if "FINAL" in check.upper():
                return msg.content

            # Continue improving
            messages.append({"role": "user", "content": f"Self-review: {check}. Please improve."})
            continue

        # Execute tool calls
        for tc in msg.tool_calls:
            args = json.loads(tc.function.arguments)
            result = execute_tool(tc.function.name, args)
            messages.append({
                "role": "tool",
                "tool_call_id": tc.id,
                "content": json.dumps(result),
            })

    return messages[-1].get("content", "Task incomplete.")

FAQ

Does self-reflection double the cost of every agent call?

Not quite double, because critique prompts are typically shorter than generation prompts. Expect 40-70% additional token cost per reflection cycle. The tradeoff is worth it for high-stakes outputs (reports, code, customer communications) where quality matters more than cost. Skip reflection for low-stakes tasks like simple lookups.

Can the same model effectively critique its own output?

Yes, with caveats. The same model can catch structural issues, missing information, and formatting problems reliably. It is less effective at catching its own factual hallucinations because the same knowledge gaps that caused the error also affect the critique. For critical accuracy requirements, use a separate verification step with tool-based fact checking.

How do I prevent reflection loops that never converge?

Set a strict maximum on reflection cycles (2-3 is usually sufficient). Use the score-and-improve pattern with a numerical threshold so you have an objective stopping criterion. If scores are not improving between iterations, break the loop — further reflection is unlikely to help, and the issue may require a fundamentally different approach.

#SelfReflection #AIAgents #CritiqueLoops #QualityAssurance #Python #AgenticAI #LearnAI #AIEngineering

Self-Reflection in AI Agents: Building Systems That Learn from Mistakes

The Problem with One-Shot Execution

The Basic Critique Loop

Score-and-Improve Pattern

Retry-with-Feedback for Tool Failures

Building a Self-Improving Agent Loop

FAQ

Does self-reflection double the cost of every agent call?

Can the same model effectively critique its own output?

How do I prevent reflection loops that never converge?

Try CallSphere AI Voice Agents

Related Articles You May Like

Multi-Agent Handoffs with the OpenAI Agents SDK: The Pattern That Actually Scales (2026)

AI Agent M&A Activity 2026: Aircall–Vogent, Meta–PlayAI, OpenAI's Six Deals

Building Your First Agent with the OpenAI Agents SDK in 2026: A Hands-On Walkthrough

LangGraph State-Machine Architecture: A Principal-Engineer Deep Dive (2026)

LangGraph Checkpointers in Production: Durable, Resumable Agents with Eval Replay

LangGraph Supervisor Pattern: Orchestrating Multi-Agent Teams in 2026