How to Evaluate Self-Improving AI Agents
← Blog

Engineering·

How to Evaluate Self-Improving AI Agents

By Oxagen Team

  • AI Agents
  • Evaluation
  • Benchmarks
  • Self-Improvement
  • MLOps

A self-improving agent that has no eval harness has no way to verify it is improving. The team argues from anecdote. The agent drifts invisibly. Eventually someone notices the outputs are worse than they were two months ago, and nobody can reconstruct when or why.

The eval harness is not an optional engineering nicety. It is the mechanism that distinguishes a self-improving agent from one that is merely changing. Without it, the four improvement mechanisms — reflection, memory, skill acquisition, parameter updating — have no ground truth to improve toward.

This guide covers what to measure, how to build a minimal but complete eval suite, the specific failure modes that evals catch early, and the traps that make agents look better than they are on metrics that do not matter.

What you are measuring

Eval for a self-improving agent is different from eval for a static model. A static model has one output distribution. A self-improving agent has a trajectory — its output quality changes over time as a function of what it has accumulated. The eval must detect movement on that trajectory, not just score a snapshot.

Four things to measure:

Task accuracy. On a stable set of representative tasks, does the agent produce correct outputs? This is the baseline measure. "Correct" is always domain-specific — code tasks use test pass rates, factual tasks use answer match, structured output tasks use schema validation rates. The metric must be automatable.

Memory recall quality. For agents with persistent memory, does the agent retrieve the right information when it matters? Two sub-metrics: recall (does the agent surface the relevant memory?) and precision (does it surface irrelevant memories that contaminate the answer?). Both matter; most evals only measure one.

Improvement rate. Over a fixed time window, is task accuracy increasing, flat, or decreasing? This is the number the "self-improving" claim is actually about. Most teams never compute it. Divide your eval history into weekly cohorts and plot the regression.

Regression rate. Does the agent get better on new tasks while getting worse on old ones? Self-improvement that trades accuracy on one task class for accuracy on another is not improvement — it is drift. The eval harness must run the full suite on every update, not just the new tasks.

The minimum viable eval suite

For a team starting from zero, the minimum viable eval suite has three parts: a task bank, a verifier, and a logging contract.

Task bank

Fifteen to twenty representative tasks, stable over time. "Stable" means the correct answer does not change. Do not use live data or time-dependent questions as eval tasks.

Composition for a typical agent:

  • 5 factual recall tasks. Questions the agent should answer from memory after accumulating a known set of facts. "What is the primary database technology in the customer's stack?" after ingesting a set of documents that contain the answer. Verifier: string match or structured comparison.

  • 5 multi-hop tasks. Questions that require combining two or more pieces of memory. "What is the VP of Engineering's preferred deployment cadence?" — requires knowing who the VP of Engineering is and what their preference is, from separate memory items. Verifier: structured comparison.

  • 5 tool-use tasks. Tasks that require the agent to call the right tool with the right arguments. Verifier: tool-call logs — did the agent call search_entities with the right workspace and query, or did it hallucinate an answer?

  • 3–5 generation tasks. Open-ended tasks where the agent produces a draft, a summary, or a recommendation. Verifier: LLM-as-judge with a fixed rubric. These are the least reliable eval tasks; use them for direction, not as primary success metrics.

Verifier

For each task, a function that takes the agent's output and returns a score between 0 and 1. The verifier must run without human input. LLM-as-judge is acceptable for generation tasks but must use a fixed rubric and a separate judge model — never the same model running the agent.

Write your verifiers before you run any evals. Teams that write verifiers after seeing agent outputs unconsciously calibrate the verifier to the agent's quirks, which defeats the point.

Logging contract

Every eval run produces a record with:

  • Timestamp
  • Agent version (or git hash)
  • Per-task score
  • Aggregate accuracy
  • P50/P95 latency
  • Token cost per task

Store this in append-only format. Never overwrite prior runs. The historical record is what lets you compute improvement rate and detect regression.

Running the eval

Run the eval on three triggers:

  1. Nightly. The agent has been accumulating data and running in production. The nightly run surfaces drift from memory accumulation — new facts that contradict old ones, entity resolution failures, retrieval degradation.

  2. On every significant memory update. If the agent ingests a new data source, runs a batch entity resolution, or updates its skill registry, run the eval immediately after. Memory changes are the most common cause of sudden accuracy regression.

  3. On every code change to the agent. Before merging a change to the reasoning loop, reflection logic, or retrieval layer, run the eval. Treat it the same as a test suite.

Interpreting results

Four patterns to watch for:

Consistent improvement. Accuracy increases week over week across all task categories. This is the goal. Verify that the improvement is real by checking whether the absolute number of tasks answered correctly is increasing, not just the percentage (the percentage can rise if you drop hard tasks from the bank).

Improvement on new tasks, regression on old ones. The agent has overfit to recent data. The memory accumulation process is crowding out older, still-relevant facts. This is the most common failure mode for agents with flat vector memory — recent embeddings push older ones below the retrieval cutoff.

Flat accuracy despite self-improvement mechanisms running. The mechanisms are running but not improving the right thing. The most common cause: the eval tasks are too easy and the agent was already near-ceiling. Add harder tasks. Alternatively, the verifier is broken — it is scoring "correct" outputs that are actually wrong. Spot-check verifier outputs.

Sudden regression. Something broke. Isolate: was it a code change, a memory update, or an infrastructure change? The logging contract tells you what changed on that run. Cross-reference with the git log and memory update timestamps.

The traps that make agents look better than they are

Eval leakage. The agent's memory contains facts from the eval tasks. Of course it gets them right — it memorized the answers. Prevent this by separating the eval data source from the production data source, or by using held-out eval tasks that the agent has never seen.

Distribution mismatch. The eval tasks were written when the agent was first deployed and no longer represent the tasks users are actually running. The agent improves on the old benchmark while regressing on real user tasks. Rotate 20–30% of eval tasks every quarter to track against current usage.

LLM-as-judge score inflation. Judge models score longer, more confident outputs higher regardless of correctness. If your primary eval metric is LLM-as-judge score and your agent is running a reflection loop, you will see apparent improvement that is actually confidence inflation. Complement LLM judge scores with at least one deterministic metric.

Goodhart's Law for agent memory. Once the agent knows what the eval tasks are, if it is accumulating memory, it will accumulate context that helps it answer those specific tasks better than others. This is benign if the eval tasks genuinely represent the task distribution — it is a problem if they do not.

Memory recall specifically

Memory recall is worth its own eval beyond general task accuracy. A purpose-built memory eval has two fixtures:

Known-fact fixture. Ingest a controlled set of facts into the agent's memory before the eval run. Each fact has an ID and a correct answer. Run retrieval queries. Score: fraction of facts correctly retrieved on-demand. This isolates the retrieval layer from the reasoning layer.

Multi-hop fixture. Ingest a set of related facts — person A works at company B, company B is a competitor of company C — and ask multi-hop questions. Score: fraction correctly answered. This tests whether the memory architecture supports graph traversal, not just lookup.

If the known-fact fixture passes but the multi-hop fixture fails, the problem is the memory architecture — flat vector retrieval without graph support. This is one of the most common root causes of plateaued agent performance in production.

Run these fixtures weekly, separate from the main task accuracy eval. Memory recall degradation often precedes task accuracy degradation by days — it is an early warning signal.

Scaling the harness

As the agent matures, the eval harness grows. A few practices that prevent it from becoming a maintenance burden:

Version the task bank. Use semantic versioning for the eval task set. v1.0 is the initial 15 tasks. v1.1 adds 5 new tasks. Track which version ran in each eval log.

Separate fast and slow evals. Fast evals (10–15 tasks, deterministic verifiers) run on every deploy. Slow evals (50+ tasks, LLM-as-judge) run nightly. Do not gate deploys on slow evals — the latency kills iteration speed.

Alert on regression, not just failure. Set a regression threshold — 5 percentage points below the prior best on any task category — that triggers an alert. Do not alert on every run; alert when the trend breaks.

FAQ

How many tasks do I need in my eval suite?

A minimum viable eval suite for a self-improving agent is 15–20 tasks covering factual recall, multi-hop recall, tool use, and generation. This is enough to detect improvement and regression without becoming a maintenance project.

What is the best verifier for agent eval?

Deterministic verifiers (string match, schema validation, test suite pass rates) are most reliable. LLM-as-judge is acceptable for generation tasks but must use a fixed rubric and a separate judge model. Never use the same model for the agent and the judge.

How do I prevent eval leakage?

Keep eval data sources separate from production data sources. Never ingest eval tasks into the agent's memory. Use held-out tasks that the agent has never processed.

How often should I run agent evals?

Nightly minimum. Also on every significant memory update and every code change to the agent. Fast deterministic evals should run on every deploy; slow LLM-judge evals nightly.

What does it mean if my agent's eval scores are flat despite running reflection?

Either the eval tasks are too easy (agent near-ceiling), the verifier is broken (scoring wrong answers as correct), or the reflection mechanism lacks a real verifier and is not producing meaningful improvements. Add harder tasks, spot-check the verifier, and ensure the reflection loop uses a deterministic verifier, not LLM self-assessment.

Further reading


Oxagen is the ontology layer for AI agents — a typed, workspace-scoped knowledge graph that makes agent memory queryable, auditable, and improvable. Read the docs to get an API key, or book a demo to see production agent memory in practice.