Engineering·March 8, 2026

Self-Improving AI Agents: A Technical Overview

By Mac Anderson

AI Agents
Ontology
Agent Memory
Architecture

Every AI agent demo in 2026 claims some form of self-improvement. The marketing word is vague on purpose: "learns from experience," "gets smarter over time," "adapts to your workflow." Strip out the hand-waving and you are left with a small set of concrete mechanisms — some production-ready, most not — and one bottleneck that almost always goes unaddressed: memory.

This overview is for engineers and architects choosing whether to build a self-improving agent, and which parts of the stack to build first. It defines the term precisely, taxonomizes the four real mechanisms, grades each on production-readiness as of 2026, and identifies the component that decides whether any of it works.

What counts as self-improving

A self-improving agent is one whose performance on a stable task distribution improves across runs, without human re-prompting or code changes, via persistent changes to its stored memory, tool set, policy, or weights.

Four things are explicitly not self-improvement:

Chain-of-thought reasoning. Thinking harder within a single request does not persist. The next request starts from zero.
RAG over a static corpus. Retrieval changes the input but not the agent. Add a million documents and the agent is no smarter, just better-sourced.
One-shot fine-tuning. A model that was fine-tuned once and shipped is not self-improving; it is a different model. The improvement is not ongoing.
Human-in-the-loop retraining. Collecting feedback and periodically retraining is a good pipeline. It is not a self-improving agent. The human closed the loop.

The test is whether the agent, left alone on a task distribution, gets measurably better week over week. Most systems shipping today do not. That is not necessarily a problem — a well-built static agent with good retrieval beats a poorly-built self-improving one every time. But if a system is being sold as self-improving, these are the mechanisms actually doing the work.

Four mechanisms

1. Reflection-based improvement

The agent critiques its own output, stores the critique, and conditions on it next time. Reflexion, Self-Refine, and actor-critic architectures all sit here.

The pattern is: produce an output, run a verifier over it (another LLM call, a test suite, a structured check), store the critique, retry. On the next similar task, retrieve prior critiques as part of the prompt.

What it needs to work: a verifier. On tasks with cheap verifiers — code (tests pass or do not), math (answer is correct or not), structured output (schema validates or does not) — reflection is a real mechanism. On open-ended tasks with no verifier, reflection often regresses performance: the agent convinces itself its first answer was correct and adds confidence without adding accuracy.

2. Memory-based improvement

The agent accumulates facts, episodes, and preferences across sessions and retrieves them on demand. Voyager's skill library, MemGPT, and the Generative Agents architecture all fit here.

The pattern is: after every interaction, write something to a persistent store. On future interactions, retrieve the relevant subset. Over time, the agent accumulates a model of the user, the domain, and the environment.

What it needs to work: a memory model that can answer structural queries, not just semantic ones. "What did the user tell me about their auth system?" is a vector query. "Which meetings included both the VP of Engineering and anyone from the security team?" is a multi-hop graph traversal. Most deployed agent memory is flat vector retrieval, which plateaus around a few thousand items and breaks entirely on multi-hop questions.

3. Skill and tool acquisition

The agent writes new tools — functions, prompts, chains — and adds them to its own available set. Voyager's skill-writing loop, ToolMaker, and Code-as-Policies are the reference implementations.

The pattern is: on a new task, attempt it with existing tools; if none fit, write a new tool, test it, add it to the registry. Future similar tasks invoke the new tool directly.

What it needs to work: a sandbox. Letting an agent write and execute arbitrary code in a production environment is the kind of architectural choice that looks great until it does not. In sandboxed environments — Minecraft for Voyager, constrained function-writing for most production uses — skill acquisition is real.

4. Parameter-updating

The agent triggers updates to its own weights, typically through RLAIF, online RLHF, or periodic fine-tuning on agent-generated data.

The pattern is: collect interaction traces, score them (by the agent itself or a judge model), update the policy. This is the mechanism that looks most like "learning" in the classical sense.

What it needs to work: an ML infrastructure most teams do not have, plus a scoring signal that does not drift. Online parameter updates are not production-ready for most teams in 2026. The reward-hacking failure modes are well-documented and the ops overhead is substantial.

Production readiness in 2026

A grade for each mechanism based on what is actually deployed, not what is in papers:

Mechanism	Production status	Primary failure mode
Reflection	Ready, scoped	No verifier → silent regression
Memory accumulation	Ready, infra-dependent	Flat vector memory plateaus
Skill acquisition	Ready, sandboxed only	Execution escape
Parameter updating	Not ready	Reward hacking + ops cost

The honest shipping order for most teams is: memory → reflection → skills → weights. Most teams get stuck at memory and ship a reflection loop over flat vector retrieval, which explains why their self-improving agent plateaus after week two.

The memory problem

Memory is the part every self-improvement story glosses over, and it is the part that determines whether anything else works.

An agent's memory has to solve four problems at once:

Typed recall. The agent needs to retrieve "all emails from Sarah mentioning the Q2 migration" as a structural query, not "top-5 semantic matches for 'Sarah Q2 migration.'" Flat vector retrieval answers the second; it cannot answer the first deterministically.

Temporal correctness. Facts change. The user's job title in 2025 may not be their job title in 2026. Memory systems that do not model time return confidently wrong answers on exactly the queries the user cares about most.

Entity resolution. "Sarah," "Sarah Chen," "schen@," and "S.C." are the same person. An agent that treats them as four entities has four disjoint memories. An agent that resolves them into one accumulates a real model. Vector-only stores do not resolve entities; they just retrieve near-duplicates.

Provenance. For an agent operating on a user's behalf — especially in enterprise settings — every stored fact needs to be traceable to its source. "Why do you think the deployment broke?" should answer with "meeting notes from March 12," not "vibes." This is an auditability and explainability requirement, not a nice-to-have.

Solving these with a flat vector store requires progressively more heroic workarounds: hybrid search, metadata filters, reranking, multi-index architectures. Each layer adds complexity. At some point — typically a few thousand items per workspace, always before production scale — the right answer is a typed, Neo4j-backed knowledge graph backing the memory layer, with vector retrieval as a secondary index.

This is where ontology comes in. A typed, queryable graph stores entities with types, relationships with directionality and time, and observations with provenance. Multi-hop traversal is a native operation. Workspace-scoped isolation is a data-model property, not a retrieval hack. Oxagen exists because building this layer in-house is a distraction for most teams and a prerequisite for any serious self-improving agent.

Minimal architecture

Every self-improving agent, stripped down, has the same six components:

Reasoning loop. The LLM call that produces the next action. Typically a single frontier model; sometimes a small router plus a larger generator.
Memory store. Where facts and episodes persist across runs. Typed — entities, relationships, observations, with time and provenance.
Reflection mechanism. A verifier that can grade the agent's output. Test suite, schema validator, another LLM, or a human score. Without one, reflection is theater.
Skill and tool registry. The set of callable actions. Editable by the agent only through a sandboxed interface, and MCP-native so the same registry works across clients.
Eval harness. A stable set of tasks run on a schedule to detect improvement and regression. Without this, you cannot tell if the agent is getting better.
Observability. Traces, memory-write logs, tool invocations — queryable and auditable. Self-improving agents fail in subtle ways; you cannot debug what you cannot see.

The architectural mistake most teams make is skipping #5. Without a standing eval harness, the agent drifts invisibly and teams argue about whether it is better or worse based on anecdotes.

What to build first

If you are starting from a static agent and want to add self-improvement, build in this order:

Typed memory. Commit to the data model before the behavior. Entities, relationships, observations, time, provenance.
Eval harness. A dozen tasks that you run nightly. Without this, none of the rest is falsifiable.
Reflection on verifiable subtasks. Start where the verifier is cheap — structured output, code, math.
Scoped skill acquisition. Only when the first three are stable and you have sandbox discipline.
Parameter updates. Only if you are a team that ships ML infrastructure. Most teams should not.

This order is unglamorous on purpose. The visible self-improvement is in steps 3–5. The work that decides whether any of it is real is in steps 1 and 2.