Blog
Engineering notes, benchmarks, and migration write-ups from the Oxagen team.
Engineering
AI Agent Benchmarks: What Actually Matters
What major agent benchmarks (SWE-bench, GAIA, AgentBench, τ-bench) actually measure — and why a high score does not predict production fit. Includes a benchmark red-flag checklist and a six-property spec for the in-domain benchmark you almost certainly need to build.
AI AgentsAgent EvaluationBenchmarksEngineering
7 Mistakes Teams Make Evaluating AI Agents
Seven evaluation mistakes that ship agents which pass evals and fail in production — endpoint-only scoring, static golden sets, uncalibrated LLM judges, ignored tool-call failures, and missing cost metrics. Each one paired with the fix.
AI AgentsAgent EvaluationEvaluationsEngineering
LLM Evals vs Agent Evals: Key Differences
LLM evals score single prompt-completion exchanges; agent evals must grade trajectories — tool choice, recovery, termination, and cost — not just final answers.
AI AgentsLLM EvaluationAgent EvaluationEngineering
7 Mistakes Developers Make Building Ontologies for AI Agents
Seven concrete failure modes that show up when teams build typed knowledge graphs for AI agents — each one observable in agent behavior, each one fixable if you catch it before the corpus scales.
OntologyAI AgentsKnowledge GraphEngineering
How to Design a Typed Schema for Agent Memory
A step-by-step guide to designing the typed schema behind an AI agent's memory — with a worked example, the decisions that matter, and the anti-patterns that silently bite in production.
OntologyAgent MemorySchema DesignEngineering
MCP-Native Ontology: Connecting AI Agents to Structured Data
A hands-on tutorial for plugging a typed, workspace-scoped knowledge graph into Cursor, Claude Code, VS Code, Windsurf, and Codex over the Model Context Protocol — with one-line installers per client.
MCPModel Context ProtocolOntologyEngineering
Knowledge Graphs vs. RAG for AI Agents: When to Use Which
Vector RAG answers semantic similarity. A typed knowledge graph answers structural questions. Most production agents need both — here's the decision framework and where each one breaks.
Knowledge GraphRAGAI AgentsEngineering
What Is an Ontology for AI Agents?
The definitive guide to ontologies for AI agents — what they are, how they differ from flat vector retrieval, when agents need one, and what a production ontology looks like in practice.
OntologyAI AgentsKnowledge GraphCulture
Working at Oxagen: the builder’s mindset
Why we hire for slope over pedigree, how “any person can be the right person for any job” works in practice, and the benefits package that matches the intensity of building ontology infrastructure for agents.
CareersStartupsCultureEngineering
Static vs Self-Improving Agents: Production Tradeoffs
A decision framework for choosing between static and self-improving agents in production — when the operational overhead of self-improvement is justified and when a well-tuned static agent wins.
AI AgentsArchitectureProductionEngineering
5 Failure Modes in Self-Improving AI Agents
The five failure modes that appear most frequently in production self-improving agents — reflection collapse, memory poisoning, entity fragmentation, eval blindness, and skill entropy — and how to detect each one early.
AI AgentsDebuggingSelf-ImprovementEngineering
How to Evaluate Self-Improving AI Agents
Designing an eval harness for self-improving agents — what metrics to track, how to detect silent drift, and the minimum viable eval suite that tells you if the agent is actually getting better.
AI AgentsEvaluationBenchmarksEngineering
Deploying Self-Improving Agents: Production Checklist
A production checklist for deploying self-improving agents — memory infrastructure requirements, observability gates, security controls, cost management, and the operational model for a running system.
AI AgentsProductionDeploymentEngineering
The Definitive Guide to Vibe Coding Platforms (2025)
An in-depth comparison of Claude Code, Cursor, v0.dev, Lovable, and Bolt.new — ranked and rated across every dimension that actually affects your workflow.
AIDeveloper ToolsVibe CodingEngineering
Frameworks for Self-Improving Agents: A Comparison
LangGraph, AutoGen, CrewAI, and Haystack compared on memory abstractions, reflection support, MCP compatibility, and production readiness for self-improving agents.
AI AgentsLangGraphAutoGenEngineering
Build a Self-Improving AI Agent in Python: Walkthrough
A step-by-step walkthrough for building a self-improving AI agent in Python with LangGraph, a typed memory store, a reflection loop with a real verifier, and a nightly benchmark harness.
PythonAI AgentsTutorialEngineering
Knowledge Graphs for Agent Memory: Design Patterns
Concrete schema patterns, entity resolution strategies, and traversal techniques for modeling agent memory as a typed knowledge graph — with the tradeoffs that decide when a graph is worth it.
Knowledge GraphAgent MemoryAI AgentsEngineering
Memory Architectures for AI Agents: Vector, Graph, Hybrid
Vector, graph, and hybrid memory architectures for AI agents compared on recall, latency, and operational cost — with the failure modes each one hits in production.
AI AgentsAgent MemoryKnowledge GraphEngineering
Reflection in AI Agents: How Self-Critique Actually Works
Reflexion, Self-Refine, and actor-critic architectures explained with benchmark data on where reflection improves agent performance and where it silently regresses it.
AI AgentsAgent ReflectionSelf-ImprovementEngineering
Self-Improving AI Agents: A Technical Overview
The four mechanisms behind self-improving agents, which are production-ready in 2026, and why memory is the bottleneck almost every implementation ignores.
AI AgentsOntologyAgent Memory