Reflection in AI Agents: How Self-Critique Actually Works
← Blog

Engineering·

Reflection in AI Agents: How Self-Critique Actually Works

By Oxagen Team

  • AI Agents
  • Agent Reflection
  • Self-Improvement
  • Architecture

Reflection is the most invoked and least understood mechanism in agent self-improvement. Every conference talk shows the diagram: agent produces output, agent critiques output, agent improves. The diagram is correct and insufficient. Whether reflection actually improves outcomes depends on a single factor the diagram leaves out — the verifier.

This piece walks through the three reflection architectures that matter (single-pass critique, iterative refinement, and actor-critic), documents where each measurably improves performance with published benchmark numbers, identifies where each silently regresses it, and provides a cost analysis in tokens per accuracy point. The goal is to help you decide whether to add reflection to your agent and, if so, which pattern fits your task.

Precise definitions

Three terms get used interchangeably in the literature. They mean different things.

Reflection is the broad category: any mechanism where an agent evaluates its own prior output and uses the evaluation to modify future behavior. The evaluation can happen within a single run or across runs via persistent memory.

Self-correction is a narrower claim: the agent detects an error in its output and fixes it without external signal. This is the variant that marketing materials imply and that benchmarks frequently fail to support. Self-correction requires the agent to identify something wrong, which presupposes the agent has access to ground truth or a reliable proxy.

Critique is the mechanism: a dedicated evaluation step — either a separate LLM call, a programmatic check, or a structured rubric — that produces a judgment about output quality. Critique is the implementation detail. Reflection is the architecture. Self-correction is the (sometimes false) promise.

The distinction matters because many reflection implementations run a critique step that says "this looks good" and the agent moves on unchanged. The critique ran. Reflection occurred. Performance did not improve. Whether the system self-corrects depends entirely on whether the critique has access to a signal that can actually distinguish good output from bad.

The verifier is the hard part

Every reflection architecture reduces to the same question: how good is the verifier?

A verifier is whatever grades the agent's output. On code tasks, the verifier is a test suite — the code runs or it does not. On math tasks, the verifier checks the answer against ground truth. On structured output, the verifier validates against a schema.

These are cheap, deterministic verifiers. On tasks with cheap verifiers, reflection consistently improves performance. This is documented.

The problem is that most real-world agent tasks do not have cheap verifiers. "Write a helpful email." "Summarize this document." "Answer the user's question." For these, the verifier is another LLM call — the same model or a judge model evaluating quality. LLM-as-judge verifiers have documented reliability problems: they favor longer outputs, outputs that match their own phrasing patterns, and outputs that sound confident regardless of accuracy. When the verifier is unreliable, reflection amplifies the unreliability.

This is not a theoretical concern. Multiple papers have shown that on open-ended tasks, reflection loops with LLM-as-judge verifiers degrade performance relative to a single well-prompted generation. The agent "improves" its output in a direction the judge prefers, which is not the same direction as the user prefers.

Rule of thumb: if you cannot write a verifier that runs without an LLM and returns a boolean, treat reflection with skepticism on that task class.

Three reflection patterns

Pattern 1: Single-pass critique

The simplest architecture. Generate an output. Run a critique over it. Optionally revise. One cycle, no loop.

How it works:

  1. The agent produces output on a task.
  2. A critique prompt evaluates the output against explicit criteria.
  3. If the critique identifies issues, the agent revises. If not, it returns the original.

Reference implementations: Constitutional AI's self-critique step, basic chain-of-thought self-review.

Strengths: Low token cost. Easy to implement. Adds genuine value when the critique prompt surfaces issues the generation prompt missed — formatting errors, constraint violations, factual inconsistencies the model "knows" are wrong.

Weaknesses: One shot at revision. If the critique misses the real problem, there is no second chance. If the critique hallucinates a problem, the revision introduces one.

Token overhead: 1.3–1.8x the base generation cost, depending on critique prompt length and whether a revision fires.

Pattern 2: Iterative refinement (Self-Refine)

Generate, critique, revise, critique again, revise again — loop until the critique is satisfied or a budget is exhausted.

How it works:

  1. The agent produces an initial output.
  2. A critique step evaluates the output and produces specific feedback.
  3. A refinement step takes the original output plus the critique and produces a revised version.
  4. Steps 2–3 repeat for N iterations or until the critique reports no further issues.

Reference implementation: Self-Refine (Madaan et al., 2023). The paper demonstrates the loop on code optimization, math reasoning, sentiment reversal, dialogue response generation, code readability, and acronym generation.

Strengths: Genuinely iterative. Each revision conditions on specific feedback from the prior round. On tasks where the critique can identify concrete, verifiable issues, each iteration measurably improves quality. The loop terminates naturally when there is nothing left to fix.

Weaknesses: On tasks with vague quality criteria, the loop does not converge — it oscillates. The agent fixes one thing and breaks another, or "refines" toward the style the LLM-as-judge prefers rather than toward correctness. Token cost scales linearly with iteration count, and the marginal improvement per iteration drops steeply after iteration 2.

Token overhead: 2.5–5x base generation cost for 2–4 iterations. Diminishing returns after iteration 2 on most benchmarks.

Pattern 3: Actor-critic (Reflexion)

Two separate components: an actor that takes actions and a critic that evaluates trajectories. The critic's evaluations persist in memory across episodes.

How it works:

  1. The actor attempts a task and produces a trajectory (sequence of actions and observations).
  2. The critic evaluates the trajectory against the task outcome — success or failure, plus a natural-language reflection on what went wrong.
  3. The reflection is stored in an episodic memory buffer.
  4. On the next attempt, the actor retrieves prior reflections as part of its prompt context.
  5. Repeat until success or attempt budget exhausted.

Reference implementation: Reflexion (Shinn et al., 2023). Tested on HumanEval (code), HotPotQA (multi-hop reasoning), and AlfWorld (embodied decision-making).

Strengths: Cross-episode learning. The actor genuinely improves across attempts because it has access to specific failure analyses from prior runs. On code generation with test-based verification, Reflexion improved HumanEval pass@1 from 80.1% to 91.0% — a meaningful jump achieved by storing reflections like "forgot edge case for empty input" and conditioning on them.

Weaknesses: Requires multiple attempts at the same task, which is natural in coding (retry until tests pass) and unnatural in most production settings (users expect one answer). The memory buffer grows and eventually needs pruning or summarization. And the entire mechanism depends on the critic producing accurate, actionable reflections — which circles back to the verifier problem.

Token overhead: 3–8x base generation cost across 2–5 episodes. Justified when the task has a deterministic verifier and multiple attempts are acceptable.

Where reflection measurably helps

Published benchmark results with honest numbers:

Task domainVerifier typeArchitectureBaselineWith reflectionSource
Code generation (HumanEval)Test suiteReflexion80.1% pass@191.0% pass@1Shinn et al. 2023
Code optimizationExecution timeSelf-RefineBaseline generation8–14% improvementMadaan et al. 2023
Math (GSM8K)Answer checkSelf-Refine81.2% accuracy84.3% accuracyMadaan et al. 2023
Structured output (JSON)Schema validationSingle-pass critique~85% valid~96% validInternal benchmarks, various
Multi-hop QA (HotPotQA)Answer matchReflexion34% exact match51% exact matchShinn et al. 2023

The pattern: reflection helps most on tasks where the verifier is cheap and deterministic, where errors are discrete and identifiable, and where the space of correct outputs is constrained. Code, math, structured output, and factual QA with ground truth all fit.

Where reflection silently regresses performance

The less-cited results:

Open-ended generation. On tasks like "write an engaging product description" or "summarize this article," reflection with LLM-as-judge verifiers frequently regresses quality. Huang et al. (2024) demonstrated that large language models cannot reliably self-correct reasoning without external feedback. The agent confidently "improves" in directions that score higher on the judge's rubric but lower on human preference.

Tasks with no natural stopping criterion. When the critique cannot say "done," the loop runs to budget. Each iteration adds cost without adding accuracy. On creative and open-ended tasks, iteration 3+ outputs are consistently rated lower than iteration 1 by human evaluators in multiple studies.

Factual questions where the model is already wrong. If the agent generates an incorrect fact and the verifier is another LLM call, the critique step typically reinforces the error. The model assigns high confidence to its own outputs. Reflection amplifies this — the agent "verifies" its answer, finds it consistent with its own knowledge, and returns it with higher confidence. This is the reflection collapse failure mode: self-consistency masquerading as self-correction.

Low-ambiguity tasks where the model is already right. Reflection adds cost without changing the output. On tasks where the base model achieves >95% accuracy, reflection rarely improves and occasionally introduces regression by "fixing" things that were not broken.

Cost analysis: tokens per accuracy point

The honest accounting most reflection proposals skip:

PatternToken multiplierAccuracy lift (best case)Accuracy lift (worst case)Tokens per accuracy point
Single-pass critique1.3–1.8x+2–5 pp0 to −1 pp0.3–0.9x base cost per point
Self-Refine (2 iterations)2.5–3.5x+3–8 pp−2 to −5 pp0.4–1.2x base cost per point
Self-Refine (4 iterations)4–5x+4–10 pp−3 to −7 pp0.5–1.3x base cost per point
Reflexion (3 episodes)4–8x+5–17 pp0 to −2 pp0.3–1.6x base cost per point

("pp" = percentage points of accuracy. "Tokens per accuracy point" is total additional tokens divided by accuracy points gained.)

The key insight: on tasks with deterministic verifiers, the cost per accuracy point is reasonable and predictable. On tasks without them, the cost is unbounded because the denominator can be zero or negative.

A practical budget rule: if reflection cannot demonstrate at least 2 percentage points of improvement on a held-out eval set within 2x the base generation cost, remove it. You are paying for theater.

When to add reflection to your agent

A decision framework in four questions:

  1. Does the task have a verifier that runs without an LLM? If yes, reflection is likely worth the cost. If no, proceed with extreme caution.
  2. Is the base model already above 95% on the task? If yes, reflection adds cost without meaningful improvement. Ship the base model.
  3. Can the agent attempt the task multiple times? If yes, actor-critic (Reflexion) is the strongest pattern. If no, single-pass critique is the ceiling.
  4. Is the marginal cost of extra tokens acceptable? Reflection is a 2–8x token multiplier. For latency-sensitive or cost-sensitive applications, that multiplier may kill the economics.

If the answers are yes, no, yes, yes — add Reflexion with a real verifier and an eval harness to measure the lift. If any answer disqualifies, skip reflection and invest in better prompting, better retrieval, or better tools instead.

The memory connection

Reflection without memory is a single-session trick. The agent critiques its output, maybe revises it, and forgets everything by the next request.

Reflection with persistent memory is a different mechanism entirely. The agent stores its critiques, retrieves relevant prior reflections on future tasks, and accumulates a model of its own failure patterns. This is where reflection connects to the broader self-improvement architecture: reflection generates the signal, memory stores it, and retrieval surfaces it when it matters.

The quality of that memory — whether it can retrieve the right reflection for the right task, avoid surfacing stale or contradictory critiques, and handle the growing volume of stored reflections — determines whether cross-session reflection works or degrades. A typed memory store with entity resolution and temporal correctness handles this; a flat vector index over reflection strings does not. The architecture considerations in memory design apply directly.

FAQ

What is agent reflection?

Agent reflection is a mechanism where an AI agent evaluates its own prior output — through a critique step, a verifier, or a judge — and uses the evaluation to improve the current or future outputs. It is one of four mechanisms behind self-improving agents.

Does reflection always improve agent performance?

No. Reflection improves performance on tasks with cheap, deterministic verifiers — code, math, structured output, factual QA. On open-ended tasks with LLM-as-judge verifiers, reflection frequently regresses performance or adds cost without improvement.

What is the difference between Reflexion and Self-Refine?

Self-Refine is an iterative loop within a single attempt: generate, critique, revise, repeat. Reflexion is an actor-critic architecture that stores reflections across multiple attempts at the same task. Reflexion learns across episodes; Self-Refine improves within one.

How much does reflection cost in tokens?

Reflection adds 1.3–8x the base generation cost depending on the pattern and number of iterations. Single-pass critique is cheapest (1.3–1.8x). Reflexion across multiple episodes is most expensive (4–8x) but produces the largest accuracy gains on verifiable tasks.

When should I not use reflection?

Skip reflection when: the base model already exceeds 95% on the task, no deterministic verifier exists, latency or cost constraints cannot absorb a 2x+ token multiplier, or your eval harness shows no measurable lift from adding it.

Further reading


Oxagen is the ontology layer for AI agents — a typed, queryable, workspace-scoped knowledge graph that stores the reflections, facts, and episodes your agents need to actually improve across sessions. Read the docs to get an API key, or book a demo to see what persistent agent memory looks like in production.