# AI Agent Evaluation: How to Know If Your Agent Actually Works

> Source: <https://pub.towardsai.net/ai-agent-evaluation-how-to-know-if-your-agent-actually-works-c30ad0f08a08?source=rss----98111c9905da---4>
> Published: 2026-06-30 16:01:02+00:00

Last year I pushed an agent into production that looked brilliant in demos. It wrote flawless code, summarized tickets, and answered questions like a senior engineer at 3am. Then it silently miscategorized 1,200 support tickets over a weekend because someone changed the dropdown values in our CRM. The model was fine. My evaluation was garbage.

That failure taught me something I should have known from years of shipping software: the thing you do not measure is the thing that breaks. Model evaluation gave us the illusion of safety, but agents are not chatbots. They plan, they call tools, they make decisions over time. And almost nobody was testing that properly.

If you are building agents right now and your test suite is “does the LLM say something reasonable,” this article is for you. I want to walk through what actually works for evaluating agent behavior in production, with all the scar tissue I collected along the way.

Model evaluation has become a comfortable science. You have your benchmarks, your leaderboards, your MMLU scores and MATH-500 numbers. But those benchmarks measure a frozen output for a fixed input. Agents do not behave that way. An agent takes a prompt, reasons through it, calls a tool, gets a result, maybe calls another tool, and eventually produces an output. Non-deterministically. Every single time.

That means you are not evaluating a model anymore. You are evaluating a system. A system with state, side effects, and execution paths that multiply every time you add a tool. One research team measured their agents at over 400 unique execution traces for a single task with 5 tools available. Five tools does not sound like much until you realize each call happens in sequence and each result shapes the next decision.

The other problem is that correctness in the agent world is fuzzy. A model either gets the right answer to the math problem or it does not. An agent might complete the goal with extra steps, or optimize for a wrong variable, or succeed technically while wasting $15 worth of API credits on redundant research. I have seen agents achieve 95% task completion while costing 10x what a simpler prompt would have.

So we need a framework that treats the entire execution as a unit of evaluation. Input, plan, tool calls, and output. All of it together. That shift is what this article covers.

Before you pick a framework, you need to decide what you care about. I see three metrics that almost every production agent should track.

**Task completion** is the obvious one. Did the agent actually do what you asked? But here is where it gets tricky: you need to separate goal completion from path completion. Sometimes the agent reaches the right answer by doing something absurd along the way. I had a data analysis agent that correctly identified trends by downloading a 2 GB CSV to its context window instead of writing a SQL aggregate query. Technically correct. Practically a disaster.

**Cost** is the metric nobody wants to look at until they get the bill. Every tool call, every reasoning token, every retry loop adds up. One agent I audited was spending 340 tokens on preamble before even looking at the user’s question. That is not a model problem, that is a configuration problem. But you will never see it if your evaluation only checks the final output.

**Latency** is the silent killer of user trust. You can have a perfect agent, but if it takes 90 seconds to respond while thinking through edge cases, your users will close the tab. I now track p50, p95, and p99 latency per agent task type. The gap between p50 and p99 tells you more about your real-world reliability than any mean.

A practical starting point is to log these three metrics for every single agent execution. Build a simple event log that captures task ID, completion status, total cost in dollars, and wall-clock time. Once you have that, you can start asking real questions. Which task types are eating budget? Where does latency spike? Which agents complete the goal but take the scenic route?

Here is where I see most teams go wrong. They write a few happy-path tests, run them once, and call it done. Agent test suites need to be living artifacts that grow with your system. I treat them like integration tests for a distributed system, because that is what an agent is.

My approach is to build three layers. The first layer is **unit tests for individual tools**. Each tool your agent can call should have its own test suite that verifies it returns correct output for known inputs. This is boring and essential. When your web search tool starts returning cached results, you want to know before your agent starts citing last week’s news.

The second layer is **scenario tests for common tasks**. These are full end-to-end runs where you give the agent a realistic prompt and check the output against expected criteria. I keep these in a YAML file that looks roughly like this:

```
scenarios:  - name: "summarize_ticket_urgent"    input: "Summarize this support ticket and assign priority: [ticket text]"    expected:      contains: ["priority: high", "refund"]      not_contains: ["priority: low"]      max_tokens: 500      max_tool_calls: 3  - name: "research_competitor_pricing"    input: "Find current pricing for [competitor] enterprise tier"    expected:      contains: ["pricing", "enterprise"]      max_latency_seconds: 30      max_cost_usd: 0.15
```

The third layer is **adversarial tests**. These are prompts designed to break your agent. Ambiguous requests, contradictory instructions, edge cases in your domain. I add at least one adversarial test every time I find a production failure. That CRM ticket disaster I mentioned? It became a test case that now runs on every deploy.

One thing I learned the hard way: do not hard-code expected outputs. Agents are non-deterministic, so exact string matching will give you flaky tests. Instead, use semantic checks. Does the output contain the key facts? Does it avoid the forbidden phrases? Is it within the cost budget? These fuzzy checks are more work to set up, but they do not break every time the model gets a minor update.

Some things are hard to check programmatically. Is this summary actually good? Did the agent reason correctly? For these questions, I use another LLM as a judge. It sounds circular, but it works surprisingly well if you are careful about how you set it up.

The pattern is straightforward. You take the agent’s full execution trace, including the input, the plan, the tool calls, and the final output. You feed that to a judge LLM along with a rubric that describes what good looks like. The judge returns a score and a rationale.

Here is the rubric template I settled on after about a dozen iterations:

```
You are evaluating an AI agent's response to a task.Task: {task_input}Agent Output: {agent_output}Tool Calls Made: {tool_calls}Score the following dimensions from 1-5:1. Goal completion: Did the agent achieve what was asked?2. Efficiency: Did the agent avoid unnecessary steps or tool calls?3. Accuracy: Are the facts in the output correct?4. Safety: Did the agent avoid harmful or inappropriate actions?Provide a one-sentence rationale for each score.
```

The key insight is that the judge needs context. If you just hand it the final output, it cannot tell whether the agent took a reasonable path. Give it the full trace, and suddenly it can evaluate process, not just product.

I also learned to use a different model for judging than the one your agent uses. If you use the same model, you get a weird self-reinforcement effect where the judge is biased toward the agent’s reasoning style. I use a cheaper, faster model for the judge and reserve the expensive model for the agent itself. The cost savings alone justify the setup.

One warning: LLM judges are not ground truth. They are a proxy. I calibrate mine by having humans score a sample of 50 traces, then checking that the judge agrees with human judgment at least 85% of the time. If it drops below that, something has drifted and I need to revisit the rubric.

This is the part that saved my sanity. Agents change behavior when you update prompts, swap models, or modify tool descriptions. Sometimes the change is intentional. Sometimes it is a silent regression that only shows up three weeks later when a customer complains.

I run a regression suite on every pull request that touches agent code. The suite replays a fixed set of scenarios and compares the new behavior against a baseline. If completion rate drops, or cost spikes, or latency increases beyond a threshold, the build fails.

The tricky part is defining “same behavior” for a non-deterministic system. I do not compare exact outputs. Instead, I compare distributions. If the agent used to complete a task in 3–5 tool calls and now it takes 8–12, that is a regression even if the final output looks fine. I track the distribution of key metrics over the last 20 runs and flag anything that shifts by more than two standard deviations.

Here is a simplified version of the check I run in CI:

``` python
def check_regression(scenario, new_result, baseline_stats):    # Check if a new agent result regresses from baseline.    checks = {        'completion': new_result.completed == baseline_stats['completed'],        'cost': new_result.total_cost <= baseline_stats['cost_p95'] * 1.2,        'latency': new_result.latency_seconds <= baseline_stats['latency_p95'] * 1.3,        'tool_calls': new_result.tool_calls <= baseline_stats['tool_calls_max'] + 2,    }    failures = [k for k, v in checks.items() if not v]    return len(failures) == 0, failures
```

The thresholds are intentionally generous. I would rather catch real regressions than chase noise. But every false positive I investigate teaches me something about how my agent behaves, so I do not mind the occasional alert.

You do not have to build everything from scratch. Several frameworks exist for agent evaluation, and each has a different philosophy. Here is what I found after spending a few weeks with the main options.

**LangSmith** gives you tracing, evaluation, and dataset management in one package. It integrates natively with LangChain, but you can use it with any agent if you instrument your code. The evaluation builder is flexible, and the trace viewer is genuinely useful for debugging. The downside is that you are locked into their platform, and pricing scales with trace volume.

**Braintrust** takes a more developer-centric approach. It treats evaluations as code, version-controlled alongside your agent. I like this because it means my test suite lives in the same repo as my agent, and changes to evaluation criteria go through code review. The flip side is that you need to be comfortable writing evaluation logic in TypeScript or Python.

**Arize Phoenix** focuses on observability. It is less about structured evaluation and more about understanding what your agent is doing in production. If you want to know why your agent started failing on Tuesdays, this is the tool. It pairs well with a separate evaluation framework for the structured test suite.

**OpenAI Evals** is the open-source option if you are already in the OpenAI ecosystem. It is lightweight and extensible, but you will need to build your own orchestration layer for running suites in CI. I use it for quick experiments but not for production regression testing.

My current setup uses Braintrust for the structured test suite and Arize Phoenix for production observability. That combination gives me both the rigor of version-controlled tests and the visibility to catch issues that only appear at scale. Your mileage will vary depending on your stack, but the principle holds: separate your test-time evaluation from your production monitoring.

The final piece is automation. All the tests in the world do not help if someone has to remember to run them. I treat agent evaluation as a first-class CI step, right next to unit tests and linting.

My pipeline has three stages. The first stage runs on every commit and checks tool-level unit tests. These are fast, usually under 30 seconds, and they catch the obvious breakages. The second stage runs on pull requests and runs the full scenario suite, including LLM-as-judge evaluations. This takes a few minutes and costs a few dollars in API credits, but it catches the subtle regressions. The third stage runs on merge to main and replays the adversarial suite, then publishes the results to a dashboard that the team reviews weekly.

One practical tip: cache your test results. If a scenario has not changed and the agent code has not changed, there is no reason to re-run the evaluation. I hash the scenario definition plus the agent configuration and use that as a cache key. This cut our CI costs by about 60%.

Another tip: set budgets. I give each CI run a maximum cost in dollars. If the evaluation suite exceeds that budget, the build fails with a clear message about which scenarios are too expensive. This forces the team to keep test scenarios lean and prevents a single expensive evaluation from eating the entire CI budget.

Agent evaluation is not a solved problem. The frameworks I described above are good enough to catch most issues, but they are not perfect. I still find edge cases that slip through, and I suspect that will be true for a while.

What I can tell you is that the teams who take evaluation seriously are the ones shipping reliable agents. The teams who treat it as an afterthought are the ones getting paged at 2am because their agent decided to email the entire customer list.

Start with the basics. Log your executions. Build a small scenario suite. Add a judge. Run it in CI. Then iterate. The perfect evaluation framework does not exist, but a good-enough framework that runs on every deploy will catch 90% of your problems before your users do.

If you want a deeper dive into any of these patterns, I have been documenting my agent evaluation setup on my blog, including the full YAML schema for scenario tests and the CI configuration I use in production. The link is in my bio.

What is the biggest blind spot in your current agent testing? I would genuinely love to hear what is breaking for you, because I guarantee it is something I have not thought of yet.

[AI Agent Evaluation: How to Know If Your Agent Actually Works](https://pub.towardsai.net/ai-agent-evaluation-how-to-know-if-your-agent-actually-works-c30ad0f08a08) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.
