Most teams I talk to have "evals." I ask them where the evals run. The answer is almost always the same: a notebook, a dashboard, a spreadsheet someone updates after a bad week. That is not an eval suite. That is a museum.
Here is the opinion I will defend for the rest of this post: if your agent's quality checks cannot block a merge, they are decorative. The entire value of an eval is that it stops a regression before it reaches a user. A score you read on Monday about a deploy you shipped Friday is a postmortem, not a gate.
We gate code with unit tests. We gate APIs with contract tests. We gate infra with terraform plan
. Then we take the single most non-deterministic component in the stack β an LLM agent that can silently change behavior when a vendor ships a new checkpoint β and we let it through on vibes. That asymmetry is the actual bug.
The failure isn't that engineers are lazy. It's that manual eval runs degrade under exactly the conditions where you need them most:
gpt-4o
, but gpt-4o
is not a constant β providers roll checkpoints. Your prompt is identical and your behavior shifted anyway.Every one of these passes code review. Every one of these is caught by a regression suite that runs on the PR. The fix is boring and it works: treat agent behavior like any other thing you'd protect with CI.
A CI gate for agents needs two things, and people consistently build only one.
The first is a scorer: something that takes the agent's output for a fixed set of inputs and returns a pass/fail signal β deterministic checks for the things that must be true (valid JSON, no banned claims, required fields present), and model-as-judge for the fuzzy stuff (was the answer actually helpful, did it stay on policy). I lean on agent-eval for this layer: define cases, attach assertions, get a gate result you can hand to process.exit
.
The second half is the one teams skip, and it's why their CI gate gets ripped out within a month. When a case fails in CI, the score alone is useless. judge_helpfulness: 0.4
on case #17 tells you nothing actionable at 2am. You need the trace β which model call ran, what the resolved prompt actually was after template interpolation, which tool the agent picked, what arguments it passed, what raw payload came back. That's what AgentLens captures: the full execution trace of how the agent reached the output that agent-eval just flagged.
This is the part to internalize: agent-eval scores the destination, AgentLens records the route. A red eval without a trace is a smoke alarm with no map of the house. The two ship as a unit because a gate you can't debug is a gate your team learns to ignore. The score tells you that case #17 regressed; the trace tells you the prompt edit changed tool selection from search_db
to web_fetch
, which is why the answer went stale. One number, one root cause, same workflow.
Here's a stripped-down CI harness. Golden inputs live in version control, agent-eval scores each output, AgentLens wraps the run so any failure carries its trace, and a non-zero exit code blocks the merge.
import { evaluate, assert } from "agent-eval";
import { trace, flush } from "agentlens";
import { runAgent } from "../src/agent";
import goldens from "./goldens.json";
// Each golden: { id, input, mustContain?, policy }
async function runSuite() {
const results = await Promise.all(
goldens.map((g) =>
// AgentLens wraps the run: every model + tool step is recorded
// under this trace id, with resolved inputs and raw outputs.
trace({ name: g.id }, async (span) => {
const output = await runAgent(g.input);
const report = await evaluate({
input: g.input,
output,
checks: [
// Deterministic gates β cheap, zero flake.
assert.isValidJson(),
assert.contains(g.mustContain ?? []),
assert.notContains(["as an AI", "I cannot verify"]),
// Model-as-judge for the fuzzy contract.
assert.judge({
criterion: g.policy,
threshold: 0.7,
}),
],
});
// Attach the eval verdict to the trace so a CI failure
// links straight to the steps that produced it.
span.setAttribute("eval.passed", report.passed);
span.setAttribute("eval.score", report.score);
return { id: g.id, ...report, traceUrl: span.url };
})
)
);
await flush();
const failed = results.filter((r) => !r.passed);
for (const f of failed) {
console.error(`FAIL ${f.id} score=${f.score.toFixed(2)}`);
console.error(` why: ${f.failingChecks.join(", ")}`);
console.error(` trace: ${f.traceUrl}`); // jump to the route, not just the score
}
if (failed.length > 0) {
console.error(`\n${failed.length}/${results.length} cases regressed.`);
process.exit(1); // <- this is the whole point: block the merge
}
console.log(`All ${results.length} cases passed.`);
}
runSuite();
Wire that into a workflow step and you're done:
- name: Agent regression gate
run: npx tsx evals/run.ts
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
The failure output isn't score: 0.4
. It's FAIL refund_policy_edge score=0.55, why=judge_policy
, plus a trace link that drops you onto the exact tool call where the agent went off-script. That difference β score-plus-trace versus score-alone β is what makes engineers trust the gate instead of routing around it.
Two practical guardrails, because a slow or flaky gate gets disabled:
Stop treating eval scores as analytics you review and start treating them as tests that fail the build. The bar is not "do we have evals." The bar is "can a bad prompt edit fail to merge." If the answer is no, you don't have an eval suite β you have a dashboard you'll stop opening.
Score the output with agent-eval. Capture the route with AgentLens. Exit non-zero when it regresses. That's the whole discipline, and it's the difference between catching the drift on the PR and explaining it to a customer.