Engineering·March 23, 2026

Frameworks for Self-Improving Agents: A Comparison

By Mac Anderson

AI Agents
LangGraph
AutoGen
Agent Frameworks
Architecture

Building a self-improving agent requires four components: a reasoning loop, a memory store, a reflection mechanism, and an eval harness. Frameworks differ not in whether they support these components, but in how opinionated they are, how far the abstractions hold, and where each one forces you to drop into raw code.

This comparison covers the five frameworks most teams reach for in 2026 — LangGraph, AutoGen, CrewAI, Haystack, and Semantic Kernel — evaluated on the four components that determine whether a self-improving agent actually ships. The goal is not to declare a winner. It is to clarify the tradeoffs so you can choose the right starting point for your architecture.

The four things that matter for self-improvement

Before the comparison, a clear breakdown of what each framework is being evaluated on:

Memory persistence and recall. Can the framework store and retrieve facts, episodes, and preferences across sessions? What is the retrieval primitive — vector, graph, structured query? Does it support entity resolution, temporal correctness, or provenance?

Reflection and critique. Does the framework provide built-in abstractions for agent self-critique, iterative refinement, or actor-critic patterns? Or does reflection require custom implementation?

MCP compatibility. Can the agent's tool registry plug into the Model Context Protocol? MCP-native tooling means the same agent works in Claude, ChatGPT, and other MCP clients without re-implementing tool schemas.

Production readiness. Observable, deployable, maintainable. Does the framework emit structured traces? Are there stable APIs for long-running agents? How much framework code is in the critical path at runtime?

LangGraph

LangGraph (part of the LangChain ecosystem) models agents as state machines — directed graphs where nodes are functions and edges are conditional transitions. The graph structure makes complex control flow explicit and debuggable.

Memory: LangGraph provides a Checkpoint abstraction that persists state between invocations. The state schema is user-defined, which means memory design is completely your responsibility. Out of the box, the framework offers thread-level persistence with SQLite or Postgres backends. It does not provide entity resolution, typed graph storage, or multi-hop retrieval — those require either custom code or an external memory layer.

Reflection: No built-in reflection pattern, but the graph structure makes it easy to add. A critique node — a function that takes the prior output and returns feedback — inserts cleanly into the state graph. The pattern is "emit to critique node, critique node writes to state, generation node conditions on critique." Implementing this takes about 30 lines, and the graph visualization makes the feedback loop easy to reason about.

MCP compatibility: LangGraph supports MCP tool servers through a tool integration layer. There are some rough edges around dynamic tool discovery and auth flows, but the path is documented and functional.

Production readiness: LangSmith traces LangGraph executions with full lineage. The framework has been deployed at meaningful scale. The stability story is better than it was in 2024, though LangChain's history of rapid API churn means pinning dependencies is mandatory.

Summary: Best choice for teams that want fine-grained control over agent state transitions and can build or plug in their own memory layer. The graph model pays dividends for complex reflection loops and branching evaluation logic.

AutoGen

AutoGen (Microsoft Research) is built around multi-agent conversation. Agents are message-passing entities — an orchestrator dispatches tasks to specialized sub-agents and assembles results. Reflection is natural in this model: a reviewer agent critiques the generator agent's output via structured messages.

Memory: AutoGen's memory story is the weakest of the five frameworks. The built-in Memory class is session-scoped and flat. Cross-session persistence requires external integration. The framework does not provide typed graph storage, entity resolution, or temporal retrieval. Several community extensions add vector retrieval via ChromaDB or Qdrant, but these are not first-party and vary in quality.

Reflection: AutoGen's native strength. The conversation model makes actor-critic and multi-reviewer patterns trivial — a reviewer agent simply responds to the generator's message with a critique, and the generator iterates. The framework handles message routing, turn-taking, and loop termination without custom orchestration. This is the most elegant reflection implementation of the five.

MCP compatibility: AutoGen 0.4+ added MCP tool support. The integration is newer than LangGraph's and has fewer documented edge cases. Basic tool registration works; complex auth flows and streaming tools require more care.

Production readiness: The conversation-threading model adds overhead at runtime. For high-volume, low-latency inference, the message-passing overhead is measurable. For low-frequency, high-reasoning-depth tasks, it is negligible. Observability is weaker than LangGraph/LangSmith — traces require custom telemetry or OpenTelemetry integration.

Summary: Best choice for tasks that naturally decompose into specialist agents critiquing each other. The reflection story is excellent. The memory story requires significant investment before production.

CrewAI

CrewAI organizes agents into crews — teams with roles, goals, and a process (sequential or hierarchical). The abstraction maps well to enterprise workflows: a research agent, an analyst agent, a writer agent, managed by an orchestrator.

Memory: CrewAI has invested in memory more than most frameworks. It supports short-term (in-session), long-term (cross-session with SQLite), entity memory (named entity tracking), and a UserMemory type for standing preferences. The implementation is flat vector retrieval over embedded memory items — no graph traversal, no entity resolution at the graph level, no temporal validity on stored facts. For simple use cases, this is sufficient. For production agents with hundreds of entities, the limitations surface predictably.

Reflection: Reflection is not a built-in primitive. The reviewer-agent pattern is possible — add a critic role to the crew — but it requires more explicit wiring than AutoGen. The hierarchical process type (manager agent overseeing task execution) provides a natural critique layer for long tasks.

MCP compatibility: CrewAI supports MCP tool servers via a tool wrapper, added in late 2025. Coverage is reasonable for standard tools; the integration is newer and less battle-tested than LangGraph's.

Production readiness: CrewAI runs synchronously by default; async support exists but is less mature. The framework is opinionated, which makes simple things simple and complex things rigid. When agent behavior deviates from the role/goal/crew pattern, the abstractions start working against you.

Summary: Best choice for teams mapping clear human-like role structures to agents. The memory abstractions are the most ergonomic of the five for getting something working quickly, but the limitations require acknowledgment before committing to them at scale.

Haystack

Haystack (deepset) started as a document retrieval framework and has grown into a general agent pipeline framework. Pipelines are directed acyclic graphs of components — retrievers, generators, classifiers — wired together with typed connections.

Memory: Haystack's strength is retrieval over document stores. The framework has mature integrations with Qdrant, Weaviate, Chroma, OpenSearch, and Elasticsearch. For agents that retrieve from documents (RAG), Haystack is the most production-ready option. For agents that accumulate and query their own memory — the use case this comparison is about — the framework is less opinionated and requires more custom plumbing.

Reflection: No built-in reflection patterns. Implementing a critique loop requires custom components and custom pipeline topology. The DAG model does not express cyclic feedback loops natively — a loop requires either a loop component or a Python wrapper around the pipeline. Doable, but more friction than LangGraph or AutoGen.

MCP compatibility: Experimental. Haystack tool integration focuses on OpenAI function calling and internal tool schemas. MCP-native tool discovery is not yet a first-class feature.

Production readiness: Excellent for RAG. Haystack has the most mature story for deploying document retrieval pipelines at scale, with Kubernetes-native deployment, monitoring integrations, and stable APIs. For self-improving agents specifically, it requires significant custom work.

Summary: Best choice for teams whose primary intelligence amplifier is document retrieval over a static corpus, not accumulated agent memory. For pure self-improvement use cases, it is the weakest fit.

Semantic Kernel

Semantic Kernel (Microsoft) is a framework for orchestrating AI models, plugins, and memories in enterprise .NET and Python applications. The enterprise focus means deeper integration with Azure services, Active Directory, and Microsoft 365.

Memory: Semantic Kernel has the most architecturally ambitious memory story — MemoryStore abstractions that support SQL, vector, and external stores. In practice, most deployments use flat vector retrieval via Azure AI Search. The enterprise-grade options (structured recall, entity tracking) are available but require significant configuration. MCP-native memory integration via Oxagen or similar external stores is straightforward given the abstraction layer.

Reflection: Not a native primitive. The planner component can retry failed plans, which is adjacent to reflection, but self-critique as an explicit mechanism requires custom implementation.

MCP compatibility: Semantic Kernel added first-class MCP tool support in late 2025. The Python SDK integration is cleaner than AutoGen's and roughly on par with LangGraph's.

Production readiness: Strong in .NET, adequate in Python. The enterprise deployment story (Azure-native, AAD auth, audit logging) is the best of the five frameworks for regulated environments. Python parity with .NET is ongoing.

Summary: Best choice for teams deploying to Azure infrastructure or integrating with Microsoft 365. The enterprise story is excellent; the self-improvement patterns require more custom investment than LangGraph or AutoGen.

Comparison at a glance

Framework	Memory depth	Reflection ease	MCP support	Prod readiness
LangGraph	External	Medium	Good	Strong
AutoGen	Weak	Excellent	Good	Medium
CrewAI	Flat vector	Medium	Adequate	Medium
Haystack	RAG-strong	Low	Experimental	Strong (RAG)
Semantic Kernel	Configurable	Low	Good	Strong (Azure)

The gap every framework shares: none of them ships a typed, graph-backed memory layer with entity resolution, temporal edges, and provenance out of the box. All five use flat vector retrieval as the default memory primitive, which is where production self-improving agents plateau.

What this means for the memory layer

A self-improving agent built on any of these frameworks will eventually outgrow the built-in memory abstraction. The failure is predictable: the agent accumulates a few thousand items, entity resolution fragments, multi-hop questions fail silently, and the team spends engineering cycles on workarounds.

The pattern teams follow: pick a framework for its control flow story (LangGraph for complex graphs, AutoGen for multi-agent critique), and plug a typed external memory layer into the framework's memory abstraction point. The two decisions are separable. The framework determines how the agent thinks; the memory layer determines what the agent knows.

FAQ

Which agent framework is best for building self-improving agents?

LangGraph for precise control flow and complex reflection loops; AutoGen for multi-agent critique patterns. Neither ships a production-grade memory layer — plan to integrate one externally.

Does LangGraph support MCP?

Yes. LangGraph supports MCP tool servers with documented integration. Coverage is more mature than AutoGen's or CrewAI's as of 2026.

What is the main limitation of CrewAI's memory system?

CrewAI's long-term memory uses flat vector retrieval without entity resolution or temporal validity. This works well under a few hundred memories; it breaks on entity-heavy workspaces with thousands of items.

Can I use Haystack for self-improving agents?

Haystack excels at document retrieval (RAG) but is not optimized for self-improving agent memory. The framework lacks native reflection primitives and has limited MCP support. It is the weakest fit for this use case of the five.

How do I add graph-based memory to an agent framework?

All five frameworks have an external memory abstraction point. Replace the default vector store with a typed graph-backed store — Neo4j or similar — and implement the retrieval interface. Oxagen exposes this through an MCP-native API that any framework with MCP tool support can consume.