Engineering·March 29, 2026

Deploying Self-Improving Agents: Production Checklist

By Mac Anderson

AI Agents
Production
Deployment
Infrastructure
MLOps

Deploying a self-improving agent to production requires more than deploying an LLM API wrapper. The agent accumulates state, modifies its own behavior, and runs continuously — which means infrastructure that would be optional for a stateless agent becomes load-bearing. Memory persistence, observability, and security controls are not features to add later. They need to be in place before the first production request.

This checklist is organized by deployment phase. Work through it in order. Each item has a gate condition — if you cannot satisfy the gate, do not proceed to the next phase.

Phase 1: Memory infrastructure

1.1 Choose and configure the memory store

The minimum viable production memory store for a self-improving agent is a vector index plus workspace-scoping. The recommended production store is a typed graph (Neo4j or equivalent) plus a vector index for hybrid retrieval.

Gate: can the memory store answer these three queries correctly?

Single-entity lookup: "Retrieve all observations about Sarah Chen" → returns observations about Sarah Chen and no other entity
Multi-hop query (graph only): "Which teams have Sarah Chen worked with?" → traverses relationships, not keyword search
Workspace isolation: a query with workspace_id=A returns zero results from workspace B even if both contain identical entity names

If the memory store cannot answer query 1 correctly, fix entity resolution before proceeding. If it cannot answer query 2, accept the limitation and document it as a known constraint. If it cannot answer query 3, do not proceed — workspace leakage is a security failure, not a limitation.

1.2 Configure workspace-scoped access control

Every memory operation must be scoped to a workspace identifier. The workspace ID must come from an authenticated request context — never from user-supplied input that bypasses authentication.

Minimum configuration:

Every read and write to the memory store includes workspace_id as a mandatory filter
The filter is applied at the data layer (database query), not the application layer (Python code)
Access tokens are scoped to one workspace; tokens cannot query across workspace boundaries

1.3 Set memory write policies

Decide before deployment:

Policy	Recommended value	Reasoning
Minimum confidence for auto-write	0.75	Below this, extraction errors outweigh signal
Contradiction detection	Enabled	New observations that conflict with existing ones are queued, not auto-written
Memory retention period	90 days (configurable)	Stale facts degrade retrieval quality; aged observations need revalidation
Max observations per entity	500	Prevents single entities from dominating retrieval

1.4 Verify memory round-trip latency

Write a test observation, query for it, measure round-trip time. Benchmark:

Write latency: < 200ms P99
Query latency: < 100ms P99 for top-5 retrieval
Multi-hop traversal (graph): < 50ms P99 for 3-hop queries

If write latency is too high, consider async writes — the agent returns the output to the user immediately and the memory write happens in a background task. This is the recommended pattern for latency-sensitive applications.

Phase 2: Observability

2.1 Structured traces for every agent invocation

Every request through the agent must produce a structured trace with:

Request ID
Workspace ID (no PII)
Task type / routing label
Memory queries issued (query text, result count, latency)
LLM calls made (model, token count, latency, success/failure)
Reflection iterations (count, verifier result per iteration)
Memory writes triggered (observation type, confidence)
Total latency and cost

Do not log PII, user content, or memory observation text in traces. Log structure and metadata only.

2.2 Benchmark harness logging

The benchmark harness must write results to an append-only log with:

ISO timestamp
Agent version / git hash
Per-task pass/fail
Aggregate pass rate
P50 and P95 latency per task

The log must be queryable. Store in a time-series database, a structured append-only file, or a logging service that supports time-range queries.

2.3 Alerting thresholds

Configure alerts before going live:

Alert	Threshold	Action
Benchmark pass rate drops > 5pp from prior best	Fire immediately	Page on-call
Memory write failure rate > 1%	Fire after 5 minutes	Investigate extraction pipeline
LLM API error rate > 0.5%	Fire after 5 minutes	Check provider status
P99 latency > 3x baseline	Fire after 10 minutes	Investigate memory query pattern
Daily memory write volume drops > 50%	Fire daily	Check ingestion pipeline

2.4 Cost tracking

Self-improving agents have variable per-request costs because reflection adds LLM calls. Track:

LLM tokens per request (input + output, broken down by node)
Memory extraction tokens (the step that converts outputs to graph observations)
Benchmark harness cost per run (LLM-as-judge calls if used)

Set a daily cost budget and alert at 80% and 100% of budget. Runaway reflection loops (verifier never passes, iterations exhaust budget) can multiply per-request cost by 5–8x without alerting if cost tracking is not in place.

Phase 3: Security controls

3.1 Input validation before memory ingestion

Every piece of content that enters the memory pipeline must be validated:

Maximum content length per observation (recommended: 2,000 characters)
Disallowed content patterns for instruction injection (prompts that attempt to override the extraction step's instructions)
Source allowlist: only ingest content from approved sources; reject untrusted documents from external users unless they are handled in a sandboxed extraction environment

Instruction injection into the memory pipeline — an adversarial document that instructs the extraction model to write false facts — is the primary attack surface for self-improving agents. The mitigation is input validation at the ingestion boundary.

3.2 Memory write audit log

Every observation write must be logged with:

Source document or interaction ID
Extraction confidence
Entity name and type
Timestamp
Agent version that wrote it

The audit log enables provenance queries ("why does the agent believe X?") and supports incident response when memory poisoning is detected.

3.3 Skill registry controls (if skill acquisition is enabled)

If the agent writes new tools to a skill registry:

Sandbox execution: agent-written code runs in an isolated environment (separate container, no network access to production systems)
Human review gate: no tool enters the production registry without review
Tool execution limits: maximum CPU and memory per tool invocation, maximum wall time
Registry size cap: maximum number of tools in the production registry (recommended: 50–100)

3.4 Secrets and credentials

The agent's memory API key must be scoped to the minimum required permissions (read + write observations; no workspace creation, no billing)
API keys rotate on a fixed schedule (recommended: 90 days)
No credentials in logs, traces, or memory observations
API keys stored in a secrets manager; never in environment variables committed to source control

Phase 4: Operational readiness

4.1 Rollback plan

Every deployment must have a rollback plan that addresses:

Memory rollback: if a bad memory update poisons the workspace, what is the procedure to restore to a prior checkpoint? Graph databases support point-in-time restore; configure backups before deployment.
Agent version rollback: rolling back the code does not roll back accumulated memory. Document whether old agent code is compatible with current memory state.
Benchmark regression response: if a deployment causes benchmark regression, what is the threshold for automatic rollback vs. manual review?

4.2 Incident response runbook

Document before deployment:

Benchmark regression alert fires. Check: recent code deploy? Recent large memory update? Recent ingestion source added? Rollback to prior agent version if regression > 10pp.
Memory poisoning detected. Check provenance audit log. Identify source. Soft-delete affected observations. Re-run entity resolution. Re-run benchmark.
Workspace isolation failure. Halt agent immediately. Audit all queries in the past 24 hours. Identify affected workspaces. Notify affected tenants per your breach notification policy.
Cost spike. Identify which task type is driving cost. Check reflection iteration count per task — runaway reflection is the most common cause. Apply a hard reflection iteration cap (2 maximum) if not already in place.

4.3 Capacity planning

Self-improving agents have two growth curves that must be capacity-planned independently:

Request volume. Standard LLM API capacity planning. Scale inference compute with request volume.

Memory volume. The memory store grows continuously. Benchmark retrieval latency at 10x current observation count. If latency degrades unacceptably before reaching 10x, upgrade the memory infrastructure before hitting the threshold in production. A typed graph store scales to hundreds of millions of nodes with proper indexing; a flat vector store degrades measurably past 50,000–100,000 observations per workspace.

4.4 Final gate: go/no-go criteria

Do not deploy to production unless all of the following are true:

Benchmark pass rate ≥ 80% on the full task set
Memory round-trip latency meets the benchmarks in Phase 1
Alerting is configured and tested (send a test alert manually)
Audit logging is enabled and queryable
Workspace isolation is verified by a test query across workspace boundaries
Cost tracking is live and baseline cost per request is documented
Rollback plan is documented and tested
At least one person on the team has read the incident response runbook

FAQ

What is the minimum memory infrastructure for a production self-improving agent?

Minimum: a vector index with workspace-scoping and the ability to filter by entity name and observation type. Recommended: a typed graph store with vector indexing on observations, entity resolution, and temporal validity on relationships.

How do I prevent instruction injection into agent memory?

Validate all content at the ingestion boundary: maximum length, disallowed instruction patterns, and source allowlist. Run extraction in a system prompt context that explicitly instructs the model to extract facts, not execute instructions. A sandboxed extraction environment is the strongest mitigation.

What is the recommended reflection iteration cap?

Two iterations maximum in production. Beyond two iterations, marginal accuracy improvement is near zero while token cost continues to grow. A third iteration adds cost without adding quality on most real-world task distributions.

How often should backups of the memory store run?

Daily minimum. For high-value workspaces (enterprise, regulated), hourly is appropriate. Test restore quarterly — a backup that cannot be restored is not a backup.

What should I do if the benchmark pass rate drops suddenly?

Check in this order: (1) recent code deploy, (2) recent large memory update or ingestion source change, (3) memory store infrastructure (latency, availability). If the cause is not immediately obvious, roll back to the prior agent version and investigate from a stable baseline.