Deploying Self-Improving Agents: Production Checklist
← Blog

Engineering·

Deploying Self-Improving Agents: Production Checklist

By Oxagen Team

  • AI Agents
  • Production
  • Deployment
  • Infrastructure
  • MLOps

Deploying a self-improving agent to production requires more than deploying an LLM API wrapper. The agent accumulates state, modifies its own behavior, and runs continuously — which means infrastructure that would be optional for a stateless agent becomes load-bearing. Memory persistence, observability, and security controls are not features to add later. They need to be in place before the first production request.

This checklist is organized by deployment phase. Work through it in order. Each item has a gate condition — if you cannot satisfy the gate, do not proceed to the next phase.

Phase 1: Memory infrastructure

1.1 Choose and configure the memory store

The minimum viable production memory store for a self-improving agent is a vector index plus workspace-scoping. The recommended production store is a typed graph (Neo4j or equivalent) plus a vector index for hybrid retrieval.

Gate: can the memory store answer these three queries correctly?

  • Single-entity lookup: "Retrieve all observations about Sarah Chen" → returns observations about Sarah Chen and no other entity
  • Multi-hop query (graph only): "Which teams have Sarah Chen worked with?" → traverses relationships, not keyword search
  • Workspace isolation: a query with workspace_id=A returns zero results from workspace B even if both contain identical entity names

If the memory store cannot answer query 1 correctly, fix entity resolution before proceeding. If it cannot answer query 2, accept the limitation and document it as a known constraint. If it cannot answer query 3, do not proceed — workspace leakage is a security failure, not a limitation.

1.2 Configure workspace-scoped access control

Every memory operation must be scoped to a workspace identifier. The workspace ID must come from an authenticated request context — never from user-supplied input that bypasses authentication.

Minimum configuration:

  • Every read and write to the memory store includes workspace_id as a mandatory filter
  • The filter is applied at the data layer (database query), not the application layer (Python code)
  • Access tokens are scoped to one workspace; tokens cannot query across workspace boundaries

1.3 Set memory write policies

Decide before deployment:

PolicyRecommended valueReasoning
Minimum confidence for auto-write0.75Below this, extraction errors outweigh signal
Contradiction detectionEnabledNew observations that conflict with existing ones are queued, not auto-written
Memory retention period90 days (configurable)Stale facts degrade retrieval quality; aged observations need revalidation
Max observations per entity500Prevents single entities from dominating retrieval

1.4 Verify memory round-trip latency

Write a test observation, query for it, measure round-trip time. Benchmark:

  • Write latency: < 200ms P99
  • Query latency: < 100ms P99 for top-5 retrieval
  • Multi-hop traversal (graph): < 50ms P99 for 3-hop queries

If write latency is too high, consider async writes — the agent returns the output to the user immediately and the memory write happens in a background task. This is the recommended pattern for latency-sensitive applications.

Phase 2: Observability

2.1 Structured traces for every agent invocation

Every request through the agent must produce a structured trace with:

  • Request ID
  • Workspace ID (no PII)
  • Task type / routing label
  • Memory queries issued (query text, result count, latency)
  • LLM calls made (model, token count, latency, success/failure)
  • Reflection iterations (count, verifier result per iteration)
  • Memory writes triggered (observation type, confidence)
  • Total latency and cost

Do not log PII, user content, or memory observation text in traces. Log structure and metadata only.

2.2 Benchmark harness logging

The benchmark harness must write results to an append-only log with:

  • ISO timestamp
  • Agent version / git hash
  • Per-task pass/fail
  • Aggregate pass rate
  • P50 and P95 latency per task

The log must be queryable. Store in a time-series database, a structured append-only file, or a logging service that supports time-range queries.

2.3 Alerting thresholds

Configure alerts before going live:

AlertThresholdAction
Benchmark pass rate drops > 5pp from prior bestFire immediatelyPage on-call
Memory write failure rate > 1%Fire after 5 minutesInvestigate extraction pipeline
LLM API error rate > 0.5%Fire after 5 minutesCheck provider status
P99 latency > 3x baselineFire after 10 minutesInvestigate memory query pattern
Daily memory write volume drops > 50%Fire dailyCheck ingestion pipeline

2.4 Cost tracking

Self-improving agents have variable per-request costs because reflection adds LLM calls. Track:

  • LLM tokens per request (input + output, broken down by node)
  • Memory extraction tokens (the step that converts outputs to graph observations)
  • Benchmark harness cost per run (LLM-as-judge calls if used)

Set a daily cost budget and alert at 80% and 100% of budget. Runaway reflection loops (verifier never passes, iterations exhaust budget) can multiply per-request cost by 5–8x without alerting if cost tracking is not in place.

Phase 3: Security controls

3.1 Input validation before memory ingestion

Every piece of content that enters the memory pipeline must be validated:

  • Maximum content length per observation (recommended: 2,000 characters)
  • Disallowed content patterns for instruction injection (prompts that attempt to override the extraction step's instructions)
  • Source allowlist: only ingest content from approved sources; reject untrusted documents from external users unless they are handled in a sandboxed extraction environment

Instruction injection into the memory pipeline — an adversarial document that instructs the extraction model to write false facts — is the primary attack surface for self-improving agents. The mitigation is input validation at the ingestion boundary.

3.2 Memory write audit log

Every observation write must be logged with:

  • Source document or interaction ID
  • Extraction confidence
  • Entity name and type
  • Timestamp
  • Agent version that wrote it

The audit log enables provenance queries ("why does the agent believe X?") and supports incident response when memory poisoning is detected.

3.3 Skill registry controls (if skill acquisition is enabled)

If the agent writes new tools to a skill registry:

  • Sandbox execution: agent-written code runs in an isolated environment (separate container, no network access to production systems)
  • Human review gate: no tool enters the production registry without review
  • Tool execution limits: maximum CPU and memory per tool invocation, maximum wall time
  • Registry size cap: maximum number of tools in the production registry (recommended: 50–100)

3.4 Secrets and credentials

  • The agent's memory API key must be scoped to the minimum required permissions (read + write observations; no workspace creation, no billing)
  • API keys rotate on a fixed schedule (recommended: 90 days)
  • No credentials in logs, traces, or memory observations
  • API keys stored in a secrets manager; never in environment variables committed to source control

Phase 4: Operational readiness

4.1 Rollback plan

Every deployment must have a rollback plan that addresses:

  • Memory rollback: if a bad memory update poisons the workspace, what is the procedure to restore to a prior checkpoint? Graph databases support point-in-time restore; configure backups before deployment.
  • Agent version rollback: rolling back the code does not roll back accumulated memory. Document whether old agent code is compatible with current memory state.
  • Benchmark regression response: if a deployment causes benchmark regression, what is the threshold for automatic rollback vs. manual review?

4.2 Incident response runbook

Document before deployment:

  1. Benchmark regression alert fires. Check: recent code deploy? Recent large memory update? Recent ingestion source added? Rollback to prior agent version if regression > 10pp.
  2. Memory poisoning detected. Check provenance audit log. Identify source. Soft-delete affected observations. Re-run entity resolution. Re-run benchmark.
  3. Workspace isolation failure. Halt agent immediately. Audit all queries in the past 24 hours. Identify affected workspaces. Notify affected tenants per your breach notification policy.
  4. Cost spike. Identify which task type is driving cost. Check reflection iteration count per task — runaway reflection is the most common cause. Apply a hard reflection iteration cap (2 maximum) if not already in place.

4.3 Capacity planning

Self-improving agents have two growth curves that must be capacity-planned independently:

Request volume. Standard LLM API capacity planning. Scale inference compute with request volume.

Memory volume. The memory store grows continuously. Benchmark retrieval latency at 10x current observation count. If latency degrades unacceptably before reaching 10x, upgrade the memory infrastructure before hitting the threshold in production. A typed graph store scales to hundreds of millions of nodes with proper indexing; a flat vector store degrades measurably past 50,000–100,000 observations per workspace.

4.4 Final gate: go/no-go criteria

Do not deploy to production unless all of the following are true:

  • Benchmark pass rate ≥ 80% on the full task set
  • Memory round-trip latency meets the benchmarks in Phase 1
  • Alerting is configured and tested (send a test alert manually)
  • Audit logging is enabled and queryable
  • Workspace isolation is verified by a test query across workspace boundaries
  • Cost tracking is live and baseline cost per request is documented
  • Rollback plan is documented and tested
  • At least one person on the team has read the incident response runbook

FAQ

What is the minimum memory infrastructure for a production self-improving agent?

Minimum: a vector index with workspace-scoping and the ability to filter by entity name and observation type. Recommended: a typed graph store with vector indexing on observations, entity resolution, and temporal validity on relationships.

How do I prevent instruction injection into agent memory?

Validate all content at the ingestion boundary: maximum length, disallowed instruction patterns, and source allowlist. Run extraction in a system prompt context that explicitly instructs the model to extract facts, not execute instructions. A sandboxed extraction environment is the strongest mitigation.

What is the recommended reflection iteration cap?

Two iterations maximum in production. Beyond two iterations, marginal accuracy improvement is near zero while token cost continues to grow. A third iteration adds cost without adding quality on most real-world task distributions.

How often should backups of the memory store run?

Daily minimum. For high-value workspaces (enterprise, regulated), hourly is appropriate. Test restore quarterly — a backup that cannot be restored is not a backup.

What should I do if the benchmark pass rate drops suddenly?

Check in this order: (1) recent code deploy, (2) recent large memory update or ingestion source change, (3) memory store infrastructure (latency, availability). If the cause is not immediately obvious, roll back to the prior agent version and investigate from a stable baseline.

Further reading


Oxagen is the ontology layer for AI agents — a typed, workspace-scoped, Neo4j-backed knowledge graph with audit logging, entity resolution, and MCP-native access. Read the docs to get an API key, or book a demo for production deployment guidance and dedicated infrastructure options.