Engineering·April 28, 2026

LLM Evals vs Agent Evals: Key Differences

By Mac Anderson

AI Agents
LLM Evaluation
Agent Evaluation
Evaluations
Tool Use
MLOps

If your team is evaluating an AI agent the same way you'd evaluate a language model, you're measuring the wrong thing — and you won't find out until something breaks in production.

LLM evaluation is a solved-enough problem. You have a model, you have a prompt, you measure output quality against a reference. Single input, single output. Clean.

Agent evaluation is not that. An agent makes decisions, calls tools, adjusts based on intermediate results, and pursues a goal across multiple steps. Measuring only the final output — whether you got the right answer — tells you almost nothing about whether the agent is reliable, efficient, or safe to run at scale.

This article draws the line between the two clearly, so you know exactly what you need to add when you move from evaluating a model to evaluating a system.

What LLM evaluation measures

A standard LLM eval has three components: a prompt (or prompt template), an expected output (or rubric), and a scoring function.

You vary the prompt across a dataset, collect outputs, and grade them — either against a reference string (exact match, ROUGE, BLEU), against a rubric (LLM-as-judge), or against a human panel.

The output is a distribution of scores across your dataset. You use that to compare model versions, detect regressions, and make go/no-go decisions on deployments.

This works because the unit of evaluation is a single exchange: prompt → completion. The model has no memory of previous turns, no tools, no environment to interact with. It produces text and stops.

LLM eval methods you're probably already using:

Exact match / F1 — for classification and extraction tasks
ROUGE / BLEU — for summarization (increasingly replaced by LLM-as-judge)
LLM-as-judge — a separate LLM grades outputs against a rubric; works well when you don't have ground truth
Human eval — the ground truth, expensive, used for calibration
Embedding similarity — for semantic correctness without string matching

These methods are useful. They're also completely insufficient for agents.

What agent evaluation must add

An agent is not a model — it's a system. Between the initial task and the final result, an agent:

Decides which tool to call (and when)
Parses tool output and updates its internal state
Decides whether its current progress is sufficient or whether to try again
Manages context across multiple turns
Terminates (correctly, or incorrectly)

Each of these is an independent failure mode. An agent can produce a correct final answer via a broken trajectory — it got lucky. Alternatively, it can fail to reach the right answer despite making good decisions at every step, because a tool returned a bad result. Measuring only the endpoint catches neither of these cases.

The trajectory problem

The core difference between LLM and agent evaluation is that agents have trajectories. A trajectory is the full sequence of actions an agent took to reach a result: which tools it called, in what order, with what arguments, and what it received back.

Evaluating a trajectory means asking questions that don't exist in LLM eval:

Did the agent call the right tool for this step?
Did it call tools in an efficient order, or did it take unnecessary detours?
When a tool failed, did it recover correctly or did it hallucinate a result?
Did it know when to stop?

You can't answer any of these by looking at the final output.

What agent evaluation must measure

Dimension	What to ask	LLM eval covers this?
Task completion	Did the agent achieve the goal?	Partially (endpoint only)
Tool use accuracy	Were tool calls correct and necessary?	No
Trajectory efficiency	Did it take the minimal path?	No
Error recovery	Did it handle tool failures correctly?	No
Termination	Did it stop at the right point?	No
Cost	What did it spend to get there?	No

If your eval only covers the first row, you're evaluating an agent the way you'd evaluate a search engine: did it return the right document? That's not the same as knowing whether it's safe to run on your production data.

The same task, two eval mindsets

Here's a concrete example. The task: "Research the last 3 funding rounds for Anthropic and summarize the key investors."

LLM eval mindset:
You run the agent 50 times, collect the final summaries, and grade them against a reference answer. Score: 82% accuracy.

What you don't know:

Whether the agent called a web search tool or hallucinated the data
Whether it searched 3 times or 30 times
Whether it recovered when a search returned a 429 error
Whether it flagged uncertainty when a source contradicted another

Agent eval mindset:
You capture the full trajectory for each run. You grade:

Tool call selection (did it use the right search API or try to make up results?)
Search query quality (were the queries specific enough?)
Data extraction accuracy (did it pull the right numbers from the results?)
Conflict resolution (what did it do when two sources disagreed?)
Final synthesis quality (same as the LLM eval)

Score: 71% task success rate, but you now know that 40% of failures happen at the tool-call-selection step — a fixable problem, not a model quality problem.

The LLM eval told you the output was mostly right. The agent eval told you why it was wrong and where to fix it.

When LLM evals are still useful inside an agent pipeline

LLM evals don't disappear when you move to agent evaluation — they become components of it.

If your agent has a summarization step, you can LLM-eval that step in isolation. If it has a classification step (route this to tool A or tool B), you can LLM-eval that decision node. Modular evals at the step level catch regressions in individual components before they compound across a 10-step trajectory.

The rule: LLM eval at the component level, agent eval at the system level. Both are necessary. Neither is sufficient alone.

Tooling that crosses over (and tooling that doesn't)

Tool	LLM eval	Agent eval	Notes
Promptfoo	Strong	Limited	Primarily prompt/completion; trajectory support is nascent
Braintrust	Strong	Partial	Supports logging multi-step runs; scoring is still mostly output-focused
LangSmith	Partial	Growing	Trajectory capture is good; evaluation scoring still mostly manual
Inspect AI (UK AISI)	Yes	Yes	Built for agent evals; steep setup curve
Custom harness	Full control	Full control	Required for any non-standard tool inventory or grading criteria

The honest answer is that the tooling for agent evaluation is still catching up to the problem. Most mature eval frameworks were designed for LLMs and have been extended toward agents. If your agent uses non-standard tools, calls internal APIs, or has domain-specific success criteria, you'll likely need a custom harness for the trajectory-level evaluation — even if you use an off-the-shelf tool for component-level scoring.

What to do this week

If you're running LLM evals on an agent today, add three things:

Trajectory logging — capture every tool call, argument, and return value. If you can't replay a run, you can't evaluate it.
A tool-call accuracy check — for each step, was the tool called the right one? Were the arguments valid? Did it handle the response correctly?
A termination check — did the agent stop when it had enough information, or did it over-run? Did it stop too early?

These three additions don't require a new eval framework. They require capturing more data per run and adding three scoring functions to what you already have.

From there, the path is: add checkpoint grading for multi-step tasks, add cost tracking, and eventually build or adopt a dedicated agent eval harness. But you can start with trajectory logging and two new metrics today.