AI Agents

The Reflection Pattern: How AI Agents Self-Correct

The first answer an LLM gives is rarely its best. Ask a developer to write code and they'll write a draft, test it, find bugs, fix them, and iterate. AI…

March 18, 2026·8 min read·1,637 words

The first answer an LLM gives is rarely its best. Ask a developer to write code and they'll write a draft, test it, find bugs, fix them, and iterate. AI agents can do the same thing — if you give them a reflection loop.

The reflection pattern is simple: generate → evaluate → revise. The agent produces an output, evaluates it against some criteria, and revises it if the evaluation finds problems. This loop continues until the output passes evaluation or a retry limit is reached.

This one pattern — in various forms — is behind the quality improvements in Claude Code's iterative bug fixing, Codex CLI's test-and-retry loop, and the research results showing reflection boosting HumanEval coding accuracy from 80% to 91%.

But reflection isn't magic. Research also shows that naive self-correction can make things worse. On tasks like GSM8K math, self-correction consistently decreased performance — the model was more likely to "fix" a correct answer into a wrong one than to repair an actual error. Knowing when and how to use reflection is what separates a robust agent from an expensive token burner.

The Three Types of Reflection

Not all reflection loops are equal. They differ in who provides the evaluation signal and how the revision happens.

Type 1: Simple Retry (External Signal)

The simplest form: the agent tries something, gets an external signal (test failure, API error, validation rejection), and retries.


async def reflect_with_tests(agent, task: str, max_attempts: int = 3) -> dict:
    """Generate code, run tests, retry on failure."""
    history = []

    for attempt in range(max_attempts):
        # Generate
        if attempt == 0:
            prompt = task
        else:
            prompt = (
                f"Your previous attempt failed.\n\n"
                f"Code:\n```\n{last_code}\n```\n\n"
                f"Error:\n```\n{last_error}\n```\n\n"
                f"Fix the code to pass the tests."
            )

        response = await agent.generate(prompt)
        last_code = extract_code(response)

        # Evaluate (external signal)
        test_result = run_tests(last_code)

        if test_result.passed:
            return {"code": last_code, "attempts": attempt + 1, "passed": True}

        last_error = test_result.error
        history.append({"attempt": attempt + 1, "error": last_error})

    return {"code": last_code, "attempts": max_attempts, "passed": False,
            "history": history}

This is the reflection type that always works — because the evaluation signal is objective. The test either passes or it doesn't. The compiler either accepts the code or it doesn't. There's no ambiguity and no risk of the model "correcting" a right answer into a wrong one.

This is why coding agents are so effective. Every generation cycle gets concrete feedback from test execution, not from the model evaluating itself. It's the same pattern behind Reflexion's 91% HumanEval result — the paper by Shinn et al. (NeurIPS 2023) showed that language agents with verbal reinforcement learning (storing reflections on test failures as memory) dramatically outperformed single-attempt generation.

Type 2: Critic-Based Reflection (LLM Evaluator)

When you don't have an external signal (no tests, no compiler, no validator), you can use a second LLM call as the evaluator. The agent generates output, a "critic" evaluates it, and the agent revises based on the critique.


CRITIC_PROMPT = """Evaluate this {output_type} on the following criteria:
1. Accuracy: Are all claims factually correct?
2. Completeness: Does it address all parts of the request?
3. Clarity: Is it well-structured and easy to follow?
4. Specificity: Does it use concrete examples, not vague generalities?

Rate each criterion: PASS or FAIL with specific feedback.
Overall verdict: ACCEPT or REVISE.

Output to evaluate:
{output}

Original request:
{request}"""

async def reflect_with_critic(
    generator_model: str,
    critic_model: str,
    request: str,
    max_rounds: int = 2
) -> dict:
    """Generate, critique, revise loop."""

    # Initial generation
    output = await generate(generator_model, request)

    for round_num in range(max_rounds):
        # Critique
        critique = await generate(
            critic_model,
            CRITIC_PROMPT.format(
                output_type="response",
                output=output,
                request=request
            )
        )

        # Parse verdict
        if "ACCEPT" in critique:
            return {"output": output, "rounds": round_num + 1,
                    "status": "accepted"}

        # Revise
        output = await generate(
            generator_model,
            f"Revise your response based on this feedback:\n\n"
            f"Critique:\n{critique}\n\n"
            f"Original request:\n{request}\n\n"
            f"Your previous response:\n{output}\n\n"
            f"Provide an improved version."
        )

    return {"output": output, "rounds": max_rounds, "status": "max_rounds"}

Key design choice: use a different model for the critic. If the same model generates and evaluates, it tends to be blind to its own failure patterns. Use GPT-5.1 to generate and Claude Sonnet 4 to critique, or vice versa. Cross-model reflection catches errors that self-evaluation misses.

The Self-Refine paper (Madaan et al., 2023) showed this approach improves output quality by ~20% on average across tasks including code generation, math reasoning, and response optimization — even when the same model acts as both generator and critic. With cross-model critics, improvements are even larger.

Type 3: Multi-Perspective Reflection

The most sophisticated form: multiple evaluators assess the output from different angles, and the agent synthesizes their feedback into a revision.


PERSPECTIVES = {
    "technical_reviewer": """Evaluate technical accuracy and implementation
    correctness. Flag any bugs, edge cases, or security issues.""",

    "user_advocate": """Evaluate from the end user's perspective. Is it
    clear? Does it solve their actual problem? Are there confusing parts?""",

    "editor": """Evaluate writing quality. Is it concise? Are there
    redundant sections? Is the structure logical?"""
}

async def multi_perspective_reflect(
    output: str,
    request: str,
    max_rounds: int = 2
) -> dict:
    for round_num in range(max_rounds):
        # Gather critiques from all perspectives
        critiques = {}
        for name, instructions in PERSPECTIVES.items():
            critique = await generate(
                "gpt-5-nano",  # Cheap model for each critic
                f"{instructions}\n\nRequest: {request}\n\nOutput: {output}"
            )
            critiques[name] = critique

        # Check if all perspectives accept
        all_pass = all("ACCEPT" in c or "no issues" in c.lower()
                       for c in critiques.values())
        if all_pass:
            return {"output": output, "rounds": round_num + 1,
                    "status": "accepted"}

        # Synthesize feedback and revise
        combined_feedback = "\n\n".join(
            f"**{name}:** {critique}"
            for name, critique in critiques.items()
        )
        output = await generate(
            "gpt-5.1",
            f"Revise based on these reviews:\n{combined_feedback}\n\n"
            f"Original request: {request}\n"
            f"Current draft:\n{output}"
        )

    return {"output": output, "rounds": max_rounds, "status": "max_rounds"}

Multi-perspective reflection is most valuable for content generation, documentation, and user-facing output — anywhere multiple quality dimensions matter simultaneously. For pure code generation, simple retry with test execution (Type 1) is more effective and cheaper.

Real Benchmarks: When Reflection Works

The research gives us clear numbers:

Method Task Improvement Source
Reflexion HumanEval coding 80% → 91% pass@1 Shinn et al., NeurIPS 2023
Self-Refine Code, math, reasoning ~20% avg improvement Madaan et al., NeurIPS 2023
LATS (Language Agent Tree Search) HumanEval coding 92.7% pass@1 (SOTA at time) Zhou et al., 2023
Naive self-correction GSM8K math Performance decreased Huang et al., 2024

The pattern is clear: reflection works when feedback is reliable. Test execution, compiler errors, and structured validators provide reliable feedback. The model asking itself "is this correct?" is unreliable — it often can't detect its own errors, especially in reasoning tasks. For more on building reliable validation layers, see our AI agent guardrails guide and hallucination prevention strategies.

A critical survey by Kamoi et al. (TACL, 2024) concluded:

1. No study demonstrates successful self-correction using only prompted LLM feedback (except in tasks naturally suited for it)

2. Self-correction works well with reliable external feedback (tests, tools, validators)

3. Large-scale fine-tuning enables self-correction where prompting alone fails

The takeaway: don't build reflection loops where the model judges its own reasoning. Build them where external tools provide the feedback.

When Reflection Hurts

Reflection isn't free. Here's when it actively damages your system:

1. The "correct → wrong" problem

On math and reasoning tasks, models frequently change correct answers to incorrect ones during self-correction. The model doesn't actually know whether its answer is right — it just knows it was asked to reconsider. Without external verification, "reconsider" often means "second-guess."


# BAD: Naive self-correction without external signal
response = await generate("Solve: what is 247 × 38?")
# Response: "9,386" (correct)

revised = await generate(
    f"Double-check your answer: {response}\n"
    f"Is this correct? If not, fix it."
)
# Revised: "9,346" (WRONG — model "corrected" a correct answer)

Fix: Never ask the model to self-correct without giving it an external signal to correct *against*. If you can't verify the answer externally, don't use reflection — use majority voting (generate 5 answers, take the most common one) instead.

2. Infinite loops and cost explosion

Without strict limits, a reflection loop can cycle indefinitely — generating, critiquing, revising, and never converging.


# ALWAYS set hard limits
MAX_REFLECTION_ROUNDS = 3  # 2-3 is the sweet spot
MAX_TOTAL_TOKENS = 50_000  # Budget cap

# Track costs
token_budget = MAX_TOTAL_TOKENS
for round in range(MAX_REFLECTION_ROUNDS):
    if token_budget <= 0:
        break
    # ... generate and critique ...
    token_budget -= tokens_used

Research shows diminishing returns after 2-3 rounds of reflection. The first revision captures most of the improvement. Rounds 4+ rarely add value and frequently introduce new errors.

3. Critic-generator collusion

When the same model (or similar models) generate and critique, they share the same blind spots. The critic may consistently miss the same type of error the generator makes, creating a false sense of quality.

Fix: Use models from different families for generation and critique. Or better: use deterministic validators (tests, schemas, guardrails) instead of LLM critics wherever possible.

Production Implementation: The Reflection Middleware

Here's a production-ready reflection wrapper that works with any agent:


from dataclasses import dataclass, field
from typing import Callable, Optional
from enum import Enum

class ReflectionVerdict(Enum):
    ACCEPT = "accept"
    REVISE = "revise"
    FAIL = "fail"

@dataclass
class ReflectionConfig:
    max_rounds: int = 3
    max_tokens: int = 50_000
    evaluator: Optional[Callable] = None  # External evaluator
    critic_model: Optional[str] = None    # LLM critic (fallback)
    stop_on_first_accept: bool = True

@dataclass
class ReflectionResult:
    output: str
    rounds: int
    verdict: ReflectionVerdict
    history: list = field(default_factory=list)
    total_tokens: int = 0

async def reflect(
    generator: Callable,
    task: str,
    config: ReflectionConfig
) -> ReflectionResult:
    """Universal reflection loop with external or LLM-based evaluation."""

    output = await generator(task)
    history = []
    total_tokens = estimate_tokens(output)

    for round_num in range(config.max_rounds):
        # Evaluate — prefer external signal over LLM critic
        if config.evaluator:
            eval_result = await config.evaluator(output)
            verdict = eval_result["verdict"]
            feedback = eval_result.get("feedback", "")
        elif config.critic_model:
            critique = await llm_critique(config.critic_model, task, output)
            verdict = "accept" if critique["pass"] else "revise"
            feedback = critique["feedback"]
        else:
            break  # No evaluator configured

        history.append({
            "round": round_num + 1,
            "verdict": verdict,
            "feedback": feedback[:500]
        })

        if verdict == "accept":
            return ReflectionResult(
                output=output, rounds=round_num + 1,
                verdict=ReflectionVerdict.ACCEPT,
                history=history, total_tokens=total_tokens
            )

        # Budget check
        if total_tokens >= config.max_tokens:
            return ReflectionResult(
                output=output, rounds=round_num + 1,
                verdict=ReflectionVerdict.FAIL,
                history=history, total_tokens=total_tokens
            )

        # Revise
        revision_prompt = (
            f"Revise based on this feedback:\n{feedback}\n\n"
            f"Original task: {task}\n\nCurrent output:\n{output}"
        )
        output = await generator(revision_prompt)
        total_tokens += estimate_tokens(output)

    return ReflectionResult(
        output=output, rounds=config.max_rounds,
        verdict=ReflectionVerdict.FAIL,
        history=history, total_tokens=total_tokens
    )

Usage with external evaluation (preferred):


# Coding agent: reflect against test execution
config = ReflectionConfig(
    max_rounds=3,
    evaluator=lambda code: run_tests_and_evaluate(code)
)
result = await reflect(coding_agent.generate, "Write a URL parser", config)

Usage with LLM critic (when no external signal exists):


# Content agent: reflect with cross-model critic
config = ReflectionConfig(
    max_rounds=2,
    critic_model="claude-sonnet-4-20250514"
)
result = await reflect(content_agent.generate, "Write a blog post about...", config)

Decision Framework: Should You Add Reflection?

Situation Use reflection? Type
Code generation with test suite ✅ Yes Type 1 (test-driven retry)
Structured output with schema ✅ Yes Type 1 (schema validation retry)
Content/writing quality ✅ Maybe Type 2 (critic, max 2 rounds)
Math/reasoning without verifier ❌ No Use majority voting instead
Low-latency requirements ❌ No Single-pass with guardrails
High-stakes multi-dimensional ✅ Yes Type 3 (multi-perspective, 1-2 rounds)

The golden rule: if you can verify the output with a deterministic tool, use reflection. If you can only verify with another LLM call, be cautious and limit to 2 rounds.

For how reflection fits into broader agent orchestration, see our multi-agent orchestration guide. For the guardrails that complement reflection loops, see I/O validation guardrails.

FAQ

What's the difference between reflection and retry?

Retry is re-executing the same operation hoping for a different result (useful for transient errors like network timeouts). Reflection is informed revision — the agent receives specific feedback about what went wrong and uses it to improve the output. A retry loop sends the same prompt again. A reflection loop sends the error analysis and asks for a targeted fix.

How many reflection rounds should I allow?

Two to three. Research consistently shows diminishing returns after round 2-3. The first revision captures 70-80% of achievable improvement. Beyond round 3, you're burning tokens for marginal gains — or introducing new errors. Set a hard limit and a token budget.

Does the same model work as both generator and critic?

It can, but it's suboptimal. The Self-Refine paper showed improvements even with self-critique, but the model shares its own blind spots. Cross-model reflection (GPT-5.1 generates, Claude critiques) catches more errors because different model families have different failure modes. If you must use one model, at least use a different temperature or system prompt for the critic.

What is Reflexion and how does it differ from Self-Refine?

Both are reflection frameworks, but they differ in feedback mechanism. Self-Refine (Madaan et al.) uses the model's own critique as feedback — generate, self-critique, revise. Reflexion (Shinn et al.) uses external environment feedback (test results, task completion signals) and stores verbal reflections in a memory buffer across attempts. Reflexion is more powerful because external feedback is more reliable than self-evaluation. Reflexion achieved 91% on HumanEval; Self-Refine shows ~20% improvement on average across mixed tasks.

Can reflection make outputs worse?

Yes. A critical survey by Kamoi et al. (TACL, 2024) found that on math benchmarks like GSM8K, self-correction consistently decreased performance. The model changes correct answers to incorrect ones because it can't actually verify mathematical correctness through self-evaluation. The fix: only use reflection when you have a reliable evaluation signal (tests, validators, external tools). For tasks without verifiable feedback, use majority voting or confidence scoring instead.

How much does reflection cost compared to single-pass generation?

Each reflection round adds one generation call (revision) plus one evaluation call (critic or test execution). With a 2-round limit: 3x generation cost + 2x evaluation cost. If your critic is a cheap model (GPT-5-nano), total cost is roughly 3.5x single-pass. With test execution as the evaluator, the LLM cost is 3x (tests are free). Use prompt caching to reduce incremental costs — the task description and system prompt are identical across rounds.

How does reflection relate to chain-of-thought and reasoning models?

Chain-of-thought (CoT) happens within a single generation — the model "thinks step by step" before answering. Reflection happens across multiple generations — the model produces a complete output, then evaluates and revises it in a new call. Reasoning models (like o3 or Claude with extended thinking) build reflection-like processes into the model itself. For production agents, explicit reflection loops give you more control over the evaluation criteria, cost limits, and stopping conditions than relying on internal model reasoning alone.


Frequently Asked Questions

What's the difference between reflection and retry?
Retry is re-executing the same operation hoping for a different result (useful for transient errors like network timeouts). Reflection is informed revision — the agent receives specific feedback about what went wrong and uses it to improve the output. A retry loop sends the same prompt again. A reflection loop sends the error analysis and asks for a targeted fix.
How many reflection rounds should I allow?
Two to three. Research consistently shows diminishing returns after round 2-3. The first revision captures 70-80% of achievable improvement. Beyond round 3, you're burning tokens for marginal gains — or introducing new errors. Set a hard limit and a token budget.
Does the same model work as both generator and critic?
It can, but it's suboptimal. The Self-Refine paper showed improvements even with self-critique, but the model shares its own blind spots. Cross-model reflection (GPT-5.1 generates, Claude critiques) catches more errors because different model families have different failure modes. If you must use one model, at least use a different temperature or system prompt for the critic.
What is Reflexion and how does it differ from Self-Refine?
Both are reflection frameworks, but they differ in feedback mechanism. Self-Refine (Madaan et al.) uses the model's own critique as feedback — generate, self-critique, revise. Reflexion (Shinn et al.) uses external environment feedback (test results, task completion signals) and stores verbal reflections in a memory buffer across attempts. Reflexion is more powerful because external feedback is more reliable than self-evaluation. Reflexion achieved 91% on HumanEval; Self-Refine shows 20% improvement on average across mixed tasks.
Can reflection make outputs worse?
Yes. A critical survey by Kamoi et al. (TACL, 2024) found that on math benchmarks like GSM8K, self-correction consistently decreased performance. The model changes correct answers to incorrect ones because it can't actually verify mathematical correctness through self-evaluation. The fix: only use reflection when you have a reliable evaluation signal (tests, validators, external tools). For tasks without verifiable feedback, use majority voting or confidence scoring instead.
How much does reflection cost compared to single-pass generation?
Each reflection round adds one generation call (revision) plus one evaluation call (critic or test execution). With a 2-round limit: 3x generation cost + 2x evaluation cost. If your critic is a cheap model (GPT-5-nano), total cost is roughly 3.5x single-pass. With test execution as the evaluator, the LLM cost is 3x (tests are free). Use prompt caching to reduce incremental costs — the task description and system prompt are identical across rounds.
How does reflection relate to chain-of-thought and reasoning models?
Chain-of-thought (CoT) happens within a single generation — the model "thinks step by step" before answering. Reflection happens across multiple generations — the model produces a complete output, then evaluates and revises it in a new call. Reasoning models (like o3 or Claude with extended thinking) build reflection-like processes into the model itself. For production agents, explicit reflection loops give you more control over the evaluation criteria, cost limits, and stopping conditions than relying on internal model reasoning alone. ---

🔧 Tools in This Article

All tools →

Related Guides

All guides →