Engineering·
LLM Evals vs Agent Evals: Key Differences
By Mac Anderson
- AI Agents
- LLM Evaluation
- Agent Evaluation
- Evaluations
- Tool Use
- MLOps
If your team is evaluating an AI agent the same way you'd evaluate a language model, you're measuring the wrong thing — and you won't find out until something breaks in production.
LLM evaluation is a solved-enough problem. You have a model, you have a prompt, you measure output quality against a reference. Single input, single output. Clean.
Agent evaluation is not that. An agent makes decisions, calls tools, adjusts based on intermediate results, and pursues a goal across multiple steps. Measuring only the final output — whether you got the right answer — tells you almost nothing about whether the agent is reliable, efficient, or safe to run at scale.
This article draws the line between the two clearly, so you know exactly what you need to add when you move from evaluating a model to evaluating a system.
What LLM evaluation measures
A standard LLM eval has three components: a prompt (or prompt template), an expected output (or rubric), and a scoring function.
You vary the prompt across a dataset, collect outputs, and grade them — either against a reference string (exact match, ROUGE, BLEU), against a rubric (LLM-as-judge), or against a human panel.
The output is a distribution of scores across your dataset. You use that to compare model versions, detect regressions, and make go/no-go decisions on deployments.
This works because the unit of evaluation is a single exchange: prompt → completion. The model has no memory of previous turns, no tools, no environment to interact with. It produces text and stops.
LLM eval methods you're probably already using:
- Exact match / F1 — for classification and extraction tasks
- ROUGE / BLEU — for summarization (increasingly replaced by LLM-as-judge)
- LLM-as-judge — a separate LLM grades outputs against a rubric; works well when you don't have ground truth
- Human eval — the ground truth, expensive, used for calibration
- Embedding similarity — for semantic correctness without string matching
These methods are useful. They're also completely insufficient for agents.
What agent evaluation must add
An agent is not a model — it's a system. Between the initial task and the final result, an agent:
- Decides which tool to call (and when)
- Parses tool output and updates its internal state
- Decides whether its current progress is sufficient or whether to try again
- Manages context across multiple turns
- Terminates (correctly, or incorrectly)
Each of these is an independent failure mode. An agent can produce a correct final answer via a broken trajectory — it got lucky. Alternatively, it can fail to reach the right answer despite making good decisions at every step, because a tool returned a bad result. Measuring only the endpoint catches neither of these cases.
The trajectory problem
The core difference between LLM and agent evaluation is that agents have trajectories. A trajectory is the full sequence of actions an agent took to reach a result: which tools it called, in what order, with what arguments, and what it received back.
Evaluating a trajectory means asking questions that don't exist in LLM eval:
- Did the agent call the right tool for this step?
- Did it call tools in an efficient order, or did it take unnecessary detours?
- When a tool failed, did it recover correctly or did it hallucinate a result?
- Did it know when to stop?
You can't answer any of these by looking at the final output.
What agent evaluation must measure
| Dimension | What to ask | LLM eval covers this? |
|---|---|---|
| Task completion | Did the agent achieve the goal? | Partially (endpoint only) |
| Tool use accuracy | Were tool calls correct and necessary? | No |
| Trajectory efficiency | Did it take the minimal path? | No |
| Error recovery | Did it handle tool failures correctly? | No |
| Termination | Did it stop at the right point? | No |
| Cost | What did it spend to get there? | No |
If your eval only covers the first row, you're evaluating an agent the way you'd evaluate a search engine: did it return the right document? That's not the same as knowing whether it's safe to run on your production data.
The same task, two eval mindsets
Here's a concrete example. The task: "Research the last 3 funding rounds for Anthropic and summarize the key investors."
LLM eval mindset:
You run the agent 50 times, collect the final summaries, and grade them against a reference answer. Score: 82% accuracy.
What you don't know:
- Whether the agent called a web search tool or hallucinated the data
- Whether it searched 3 times or 30 times
- Whether it recovered when a search returned a 429 error
- Whether it flagged uncertainty when a source contradicted another
Agent eval mindset:
You capture the full trajectory for each run. You grade:
- Tool call selection (did it use the right search API or try to make up results?)
- Search query quality (were the queries specific enough?)
- Data extraction accuracy (did it pull the right numbers from the results?)
- Conflict resolution (what did it do when two sources disagreed?)
- Final synthesis quality (same as the LLM eval)
Score: 71% task success rate, but you now know that 40% of failures happen at the tool-call-selection step — a fixable problem, not a model quality problem.
The LLM eval told you the output was mostly right. The agent eval told you why it was wrong and where to fix it.
When LLM evals are still useful inside an agent pipeline
LLM evals don't disappear when you move to agent evaluation — they become components of it.
If your agent has a summarization step, you can LLM-eval that step in isolation. If it has a classification step (route this to tool A or tool B), you can LLM-eval that decision node. Modular evals at the step level catch regressions in individual components before they compound across a 10-step trajectory.
The rule: LLM eval at the component level, agent eval at the system level. Both are necessary. Neither is sufficient alone.
Tooling that crosses over (and tooling that doesn't)
| Tool | LLM eval | Agent eval | Notes |
|---|---|---|---|
| Promptfoo | Strong | Limited | Primarily prompt/completion; trajectory support is nascent |
| Braintrust | Strong | Partial | Supports logging multi-step runs; scoring is still mostly output-focused |
| LangSmith | Partial | Growing | Trajectory capture is good; evaluation scoring still mostly manual |
| Inspect AI (UK AISI) | Yes | Yes | Built for agent evals; steep setup curve |
| Custom harness | Full control | Full control | Required for any non-standard tool inventory or grading criteria |
The honest answer is that the tooling for agent evaluation is still catching up to the problem. Most mature eval frameworks were designed for LLMs and have been extended toward agents. If your agent uses non-standard tools, calls internal APIs, or has domain-specific success criteria, you'll likely need a custom harness for the trajectory-level evaluation — even if you use an off-the-shelf tool for component-level scoring.
What to do this week
If you're running LLM evals on an agent today, add three things:
- Trajectory logging — capture every tool call, argument, and return value. If you can't replay a run, you can't evaluate it.
- A tool-call accuracy check — for each step, was the tool called the right one? Were the arguments valid? Did it handle the response correctly?
- A termination check — did the agent stop when it had enough information, or did it over-run? Did it stop too early?
These three additions don't require a new eval framework. They require capturing more data per run and adding three scoring functions to what you already have.
From there, the path is: add checkpoint grading for multi-step tasks, add cost tracking, and eventually build or adopt a dedicated agent eval harness. But you can start with trajectory logging and two new metrics today.
Further reading
- How to Evaluate AI Agents: A Complete Framework — the eval harness and metrics taxonomy this piece builds on
- 10 Metrics Every AI Agent Eval Should Track — what to measure first in a workspace-scoped agent stack
- AI Agent Benchmarks: What Actually Matters — distribution mismatch, judge inflation, and other benchmark traps
Oxagen is the ontology layer for AI agents — a typed, workspace-scoped knowledge graph that makes agent memory queryable, auditable, and improvable. Read the docs to get an API key, or book a demo to see production agent memory in practice.