Engineering·
7 Mistakes Teams Make Evaluating AI Agents
By Mac Anderson
- AI Agents
- Agent Evaluation
- Evaluations
- Trajectory Evaluation
- Tool Use
- MLOps
Most teams building AI agents have an evaluation problem they don't know they have. The evals pass. The agent ships. Then it fails on real tasks in ways the eval never predicted.
This isn't bad luck — it's a measurement gap. Agent evaluation is genuinely harder than LLM evaluation, and the mistakes teams make are predictable. Here are the seven most common, and what to do instead.
Mistake 1: Evaluating outputs, not trajectories
What it looks like: You run the agent on 100 test tasks, check whether the final answer is correct, and report an accuracy number.
Why it fails: An agent can reach the right answer via a broken path — by hallucinating a tool call result, by getting lucky on a retry, by taking 15 steps where 4 were sufficient. It can also fail to reach the right answer despite making correct decisions at every step, because a tool returned bad data.
Evaluating only the endpoint tells you whether the agent got lucky, not whether it's reliable.
What to do instead: Capture the full trajectory for every eval run — every tool call, argument, return value, and decision. Score the trajectory, not just the result. At minimum, check: did the agent call the right tool for each step? Did it handle tool failures correctly? Did it terminate at the right point?
If you can't replay a run from its logged trajectory, you can't evaluate it.
Mistake 2: Using a golden test set that never updates
What it looks like: You curate 200 test cases at launch, run them on every release, and track whether the score goes up or down.
Why it fails: After a few months, your agent has implicitly been tuned against those 200 cases. The score stops reflecting real-world performance and starts reflecting how well your system has memorized the test set. This is Goodhart's Law applied to evals: when a measure becomes a target, it ceases to be a good measure.
Real-world tasks drift. User requests change. New tool versions behave differently. A static golden set can't track any of this.
What to do instead: Treat your test set as a living artifact. Add new cases from production failures every sprint. Retire cases that no longer reflect real usage. Maintain a holdout set that your team never looks at until a major release. Log production task samples (without PII) and periodically add them to the eval set. A test set that grows with the product tells you something real.
Mistake 3: Ignoring tool-call failure modes
What it looks like: Your eval checks whether the agent completed the task, but not whether the tools it called were correct, necessary, or handled properly when they failed.
Why it fails: Tool failures are where agent reliability actually breaks down in production. An agent that calls a search API with a malformed query, receives an error, and then hallucinates the result it "would have" gotten is indistinguishable from a working agent in an output-only eval. It's catastrophically broken in practice.
The same problem applies to unnecessary tool calls (wasted cost and latency), incorrect argument construction, and missing error handling.
What to do instead: Add explicit tool-call evaluation. For each step in the trajectory, score:
- Tool selection accuracy: Was this the right tool for this step?
- Argument validity: Were the arguments correctly constructed?
- Error handling: When the tool failed or returned unexpected output, did the agent respond correctly?
- Redundancy: Did the agent call this tool when it already had the information it needed?
This requires logging the full tool call trace, not just the final answer. If your eval harness doesn't support this yet, start with logging — scoring can come later.
Mistake 4: Treating eval as a one-time gate
What it looks like: Eval runs before a release. If it passes, the agent ships. Nobody looks at eval data between releases.
Why it fails: Agent behavior degrades continuously in production. The underlying LLM gets updated. Tool APIs change their response formats. New user request patterns emerge that your test set didn't cover. Latency and cost creep up as prompts get longer or tool usage patterns shift.
One-time gating misses all of this. By the time the next eval runs, you've lost weeks of signal.
What to do instead: Run a subset of your eval suite on every deployment, not just major releases. Set up automated alerts for regressions on key metrics (task success rate, tool call accuracy, cost per task). Log and review a sample of production runs weekly — not to fix individual failures, but to spot patterns that should feed back into the test set. Eval is a continuous monitoring discipline, not a deployment gate.
Mistake 5: Using LLM-as-judge without calibrating the judge
What it looks like: You use a GPT-4 or Claude prompt to grade your agent's outputs. The prompt says something like "Rate this response from 1-5 based on accuracy and helpfulness." You trust the scores.
Why it fails: LLM-as-judge is a powerful technique, but it imports all the biases and blindspots of the judge model. Without calibration, you don't know whether the judge is actually measuring what you care about, whether it consistently scores the same response the same way (reliability), or whether it's biased toward verbose, confident-sounding outputs regardless of accuracy (a known failure mode).
Uncalibrated LLM judges can give you high scores on wrong answers and low scores on correct ones — and you won't know.
What to do instead:
- Calibrate against human judgments. Score 100 cases with humans and with your LLM judge. Measure agreement. If Cohen's kappa is below 0.6, your judge is unreliable.
- Use structured rubrics, not open-ended prompts. "Rate accuracy 1-5" is worse than a rubric with specific criteria for each score level.
- Test for known failure modes. Add adversarial cases: confidently wrong answers, correct answers in unexpected formats, answers that sound good but are factually off. See if the judge catches them.
- Use judge ensembles for important decisions. If a score is a deployment gate, run multiple judge prompts and require agreement.
LLM-as-judge is fine for development-time feedback. It needs calibration to be a reliable eval signal.
Mistake 6: Optimizing for benchmark score instead of task success rate
What it looks like: You're tracking SWE-bench or GAIA scores as your primary quality metric. The numbers go up. The product doesn't noticeably improve.
Why it fails: Public benchmarks measure performance on a specific, curated task distribution. Your users have a different task distribution. Optimizing for SWE-bench makes your agent better at SWE-bench tasks — which may have minimal overlap with what your users actually ask it to do.
This is a specific form of Goodhart's Law: the benchmark is a proxy for quality, not quality itself. As soon as you optimize the proxy directly, the proxy and the real thing decouple.
What to do instead: Use public benchmarks for orientation (how does my agent compare to alternatives?) and for regression detection (did a model update hurt general capability?). Use your own task success rate — measured on tasks that represent real user requests — as the primary metric you optimize for.
If you don't have a production task distribution yet, build a synthetic one from user interviews, support tickets, and domain expert input. A small, domain-specific eval set of 50 representative tasks is more useful than 1,000 benchmark tasks from a different distribution.
Mistake 7: Skipping cost and latency as eval dimensions
What it looks like: Your eval measures whether the agent succeeds. It doesn't measure how long it takes or how much it costs per task.
Why it fails: An agent that achieves 90% task success rate at $0.50 per task and 45-second latency is not the same product as one that achieves 90% at $0.05 per task and 8-second latency. For most production use cases, the second agent is 10x better — even though the eval score is identical.
Cost and latency are first-class quality dimensions. They determine whether the agent is viable at scale, whether users wait for it or abandon it, and whether the unit economics work.
What to do instead: Add three metrics to every eval run:
- Cost per task — total token cost for a complete task (input + output across all turns)
- Steps per task — the number of LLM calls and tool invocations to complete the task (a proxy for latency and cost efficiency)
- Time to completion — wall-clock time from task start to final answer
Track these alongside accuracy. An improvement that raises accuracy by 2% but doubles cost per task is not obviously a win — you need the numbers to make that decision.
The common thread
Every mistake on this list comes from the same root cause: applying LLM evaluation thinking to an agent evaluation problem. LLM evals are output-focused, static, and single-turn. Agent evals need to be trajectory-focused, continuous, and multi-dimensional.
The fix is not to throw out what you have — it's to extend it. Start with trajectory logging. Add tool-call accuracy to your scoring. Set a recurring cadence for eval instead of treating it as a gate. Calibrate your judge. Measure cost.
None of these require a new framework. They require the discipline to instrument more and measure more.
Start here
If you're doing one thing after reading this: add trajectory logging to your next agent run. Capture every tool call, argument, return value, and decision. Store it somewhere queryable. You can't fix what you can't see — and right now, most teams can't see their agents' trajectories at all.
A typed knowledge graph is a natural substrate for this — every tool call becomes a typed node, every argument and return value becomes a property, and the causal sequence becomes a set of edges you can traverse to replay or compare runs. That's the substrate Oxagen provides via its MCP-native ontology layer; whether you build it yourself or plug it in, the shape of the data is the same.
From there, the path is straightforward: score the trajectories, build a living test set, and add cost and latency to your metrics. Each step is incremental. The compounding effect is a reliable agent instead of one that passes evals and breaks in production.
Further reading
- LLM Evals vs Agent Evals: Key Differences — why endpoint-only scoring misses the trajectory failures this post addresses
- How to Evaluate Self-Improving AI Agents — eval harness design when the agent itself changes between runs
- 7 Mistakes Building Ontologies for AI Agents — the schema-design counterpart to this post, for the typed graph that stores your trajectories