Engineering·
Deploying Self-Improving Agents: Production Checklist
By Oxagen Team
- AI Agents
- Production
- Deployment
- Infrastructure
- MLOps
Deploying a self-improving agent to production requires more than deploying an LLM API wrapper. The agent accumulates state, modifies its own behavior, and runs continuously — which means infrastructure that would be optional for a stateless agent becomes load-bearing. Memory persistence, observability, and security controls are not features to add later. They need to be in place before the first production request.
This checklist is organized by deployment phase. Work through it in order. Each item has a gate condition — if you cannot satisfy the gate, do not proceed to the next phase.
Phase 1: Memory infrastructure
1.1 Choose and configure the memory store
The minimum viable production memory store for a self-improving agent is a vector index plus workspace-scoping. The recommended production store is a typed graph (Neo4j or equivalent) plus a vector index for hybrid retrieval.
Gate: can the memory store answer these three queries correctly?
- Single-entity lookup: "Retrieve all observations about Sarah Chen" → returns observations about Sarah Chen and no other entity
- Multi-hop query (graph only): "Which teams have Sarah Chen worked with?" → traverses relationships, not keyword search
- Workspace isolation: a query with
workspace_id=Areturns zero results from workspace B even if both contain identical entity names
If the memory store cannot answer query 1 correctly, fix entity resolution before proceeding. If it cannot answer query 2, accept the limitation and document it as a known constraint. If it cannot answer query 3, do not proceed — workspace leakage is a security failure, not a limitation.
1.2 Configure workspace-scoped access control
Every memory operation must be scoped to a workspace identifier. The workspace ID must come from an authenticated request context — never from user-supplied input that bypasses authentication.
Minimum configuration:
- Every read and write to the memory store includes
workspace_idas a mandatory filter - The filter is applied at the data layer (database query), not the application layer (Python code)
- Access tokens are scoped to one workspace; tokens cannot query across workspace boundaries
1.3 Set memory write policies
Decide before deployment:
| Policy | Recommended value | Reasoning |
|---|---|---|
| Minimum confidence for auto-write | 0.75 | Below this, extraction errors outweigh signal |
| Contradiction detection | Enabled | New observations that conflict with existing ones are queued, not auto-written |
| Memory retention period | 90 days (configurable) | Stale facts degrade retrieval quality; aged observations need revalidation |
| Max observations per entity | 500 | Prevents single entities from dominating retrieval |
1.4 Verify memory round-trip latency
Write a test observation, query for it, measure round-trip time. Benchmark:
- Write latency: < 200ms P99
- Query latency: < 100ms P99 for top-5 retrieval
- Multi-hop traversal (graph): < 50ms P99 for 3-hop queries
If write latency is too high, consider async writes — the agent returns the output to the user immediately and the memory write happens in a background task. This is the recommended pattern for latency-sensitive applications.
Phase 2: Observability
2.1 Structured traces for every agent invocation
Every request through the agent must produce a structured trace with:
- Request ID
- Workspace ID (no PII)
- Task type / routing label
- Memory queries issued (query text, result count, latency)
- LLM calls made (model, token count, latency, success/failure)
- Reflection iterations (count, verifier result per iteration)
- Memory writes triggered (observation type, confidence)
- Total latency and cost
Do not log PII, user content, or memory observation text in traces. Log structure and metadata only.
2.2 Benchmark harness logging
The benchmark harness must write results to an append-only log with:
- ISO timestamp
- Agent version / git hash
- Per-task pass/fail
- Aggregate pass rate
- P50 and P95 latency per task
The log must be queryable. Store in a time-series database, a structured append-only file, or a logging service that supports time-range queries.
2.3 Alerting thresholds
Configure alerts before going live:
| Alert | Threshold | Action |
|---|---|---|
| Benchmark pass rate drops > 5pp from prior best | Fire immediately | Page on-call |
| Memory write failure rate > 1% | Fire after 5 minutes | Investigate extraction pipeline |
| LLM API error rate > 0.5% | Fire after 5 minutes | Check provider status |
| P99 latency > 3x baseline | Fire after 10 minutes | Investigate memory query pattern |
| Daily memory write volume drops > 50% | Fire daily | Check ingestion pipeline |
2.4 Cost tracking
Self-improving agents have variable per-request costs because reflection adds LLM calls. Track:
- LLM tokens per request (input + output, broken down by node)
- Memory extraction tokens (the step that converts outputs to graph observations)
- Benchmark harness cost per run (LLM-as-judge calls if used)
Set a daily cost budget and alert at 80% and 100% of budget. Runaway reflection loops (verifier never passes, iterations exhaust budget) can multiply per-request cost by 5–8x without alerting if cost tracking is not in place.
Phase 3: Security controls
3.1 Input validation before memory ingestion
Every piece of content that enters the memory pipeline must be validated:
- Maximum content length per observation (recommended: 2,000 characters)
- Disallowed content patterns for instruction injection (prompts that attempt to override the extraction step's instructions)
- Source allowlist: only ingest content from approved sources; reject untrusted documents from external users unless they are handled in a sandboxed extraction environment
Instruction injection into the memory pipeline — an adversarial document that instructs the extraction model to write false facts — is the primary attack surface for self-improving agents. The mitigation is input validation at the ingestion boundary.
3.2 Memory write audit log
Every observation write must be logged with:
- Source document or interaction ID
- Extraction confidence
- Entity name and type
- Timestamp
- Agent version that wrote it
The audit log enables provenance queries ("why does the agent believe X?") and supports incident response when memory poisoning is detected.
3.3 Skill registry controls (if skill acquisition is enabled)
If the agent writes new tools to a skill registry:
- Sandbox execution: agent-written code runs in an isolated environment (separate container, no network access to production systems)
- Human review gate: no tool enters the production registry without review
- Tool execution limits: maximum CPU and memory per tool invocation, maximum wall time
- Registry size cap: maximum number of tools in the production registry (recommended: 50–100)
3.4 Secrets and credentials
- The agent's memory API key must be scoped to the minimum required permissions (read + write observations; no workspace creation, no billing)
- API keys rotate on a fixed schedule (recommended: 90 days)
- No credentials in logs, traces, or memory observations
- API keys stored in a secrets manager; never in environment variables committed to source control
Phase 4: Operational readiness
4.1 Rollback plan
Every deployment must have a rollback plan that addresses:
- Memory rollback: if a bad memory update poisons the workspace, what is the procedure to restore to a prior checkpoint? Graph databases support point-in-time restore; configure backups before deployment.
- Agent version rollback: rolling back the code does not roll back accumulated memory. Document whether old agent code is compatible with current memory state.
- Benchmark regression response: if a deployment causes benchmark regression, what is the threshold for automatic rollback vs. manual review?
4.2 Incident response runbook
Document before deployment:
- Benchmark regression alert fires. Check: recent code deploy? Recent large memory update? Recent ingestion source added? Rollback to prior agent version if regression > 10pp.
- Memory poisoning detected. Check provenance audit log. Identify source. Soft-delete affected observations. Re-run entity resolution. Re-run benchmark.
- Workspace isolation failure. Halt agent immediately. Audit all queries in the past 24 hours. Identify affected workspaces. Notify affected tenants per your breach notification policy.
- Cost spike. Identify which task type is driving cost. Check reflection iteration count per task — runaway reflection is the most common cause. Apply a hard reflection iteration cap (2 maximum) if not already in place.
4.3 Capacity planning
Self-improving agents have two growth curves that must be capacity-planned independently:
Request volume. Standard LLM API capacity planning. Scale inference compute with request volume.
Memory volume. The memory store grows continuously. Benchmark retrieval latency at 10x current observation count. If latency degrades unacceptably before reaching 10x, upgrade the memory infrastructure before hitting the threshold in production. A typed graph store scales to hundreds of millions of nodes with proper indexing; a flat vector store degrades measurably past 50,000–100,000 observations per workspace.
4.4 Final gate: go/no-go criteria
Do not deploy to production unless all of the following are true:
- Benchmark pass rate ≥ 80% on the full task set
- Memory round-trip latency meets the benchmarks in Phase 1
- Alerting is configured and tested (send a test alert manually)
- Audit logging is enabled and queryable
- Workspace isolation is verified by a test query across workspace boundaries
- Cost tracking is live and baseline cost per request is documented
- Rollback plan is documented and tested
- At least one person on the team has read the incident response runbook
FAQ
What is the minimum memory infrastructure for a production self-improving agent?
Minimum: a vector index with workspace-scoping and the ability to filter by entity name and observation type. Recommended: a typed graph store with vector indexing on observations, entity resolution, and temporal validity on relationships.
How do I prevent instruction injection into agent memory?
Validate all content at the ingestion boundary: maximum length, disallowed instruction patterns, and source allowlist. Run extraction in a system prompt context that explicitly instructs the model to extract facts, not execute instructions. A sandboxed extraction environment is the strongest mitigation.
What is the recommended reflection iteration cap?
Two iterations maximum in production. Beyond two iterations, marginal accuracy improvement is near zero while token cost continues to grow. A third iteration adds cost without adding quality on most real-world task distributions.
How often should backups of the memory store run?
Daily minimum. For high-value workspaces (enterprise, regulated), hourly is appropriate. Test restore quarterly — a backup that cannot be restored is not a backup.
What should I do if the benchmark pass rate drops suddenly?
Check in this order: (1) recent code deploy, (2) recent large memory update or ingestion source change, (3) memory store infrastructure (latency, availability). If the cause is not immediately obvious, roll back to the prior agent version and investigate from a stable baseline.
Further reading
- Self-Improving AI Agents: A Technical Overview — the mechanisms that make agents self-improving
- 5 Failure Modes in Self-Improving AI Agents — every checklist item here prevents one of these failure modes
- How to Evaluate Self-Improving AI Agents — the benchmark harness referenced throughout this checklist
- Build a Self-Improving AI Agent in Python: Walkthrough — the agent implementation this checklist is designed to deploy
Oxagen is the ontology layer for AI agents — a typed, workspace-scoped, Neo4j-backed knowledge graph with audit logging, entity resolution, and MCP-native access. Read the docs to get an API key, or book a demo for production deployment guidance and dedicated infrastructure options.