cd /news/ai-agents/put-your-agent-evals-in-ci-or-stop-c… Β· home β€Ί topics β€Ί ai-agents β€Ί article
[ARTICLE Β· art-28774] src=dev.to β†— pub= topic=ai-agents verified=true sentiment=↑ positive

Put Your Agent Evals in CI or Stop Calling Them Evals

A developer argues that agent evaluations must run in CI to block regressions before deployment, not as post-hoc dashboards. The post introduces agent-eval for scoring outputs and AgentLens for execution traces, advocating for treating agent behavior like any other CI-gated component.

read5 min views1 publishedJun 16, 2026

Most teams I talk to have "evals." I ask them where the evals run. The answer is almost always the same: a notebook, a dashboard, a spreadsheet someone updates after a bad week. That is not an eval suite. That is a museum.

Here is the opinion I will defend for the rest of this post: if your agent's quality checks cannot block a merge, they are decorative. The entire value of an eval is that it stops a regression before it reaches a user. A score you read on Monday about a deploy you shipped Friday is a postmortem, not a gate.

We gate code with unit tests. We gate APIs with contract tests. We gate infra with terraform plan

. Then we take the single most non-deterministic component in the stack β€” an LLM agent that can silently change behavior when a vendor ships a new checkpoint β€” and we let it through on vibes. That asymmetry is the actual bug.

The failure isn't that engineers are lazy. It's that manual eval runs degrade under exactly the conditions where you need them most:

gpt-4o

, but gpt-4o

is not a constant β€” providers roll checkpoints. Your prompt is identical and your behavior shifted anyway.Every one of these passes code review. Every one of these is caught by a regression suite that runs on the PR. The fix is boring and it works: treat agent behavior like any other thing you'd protect with CI.

A CI gate for agents needs two things, and people consistently build only one.

The first is a scorer: something that takes the agent's output for a fixed set of inputs and returns a pass/fail signal β€” deterministic checks for the things that must be true (valid JSON, no banned claims, required fields present), and model-as-judge for the fuzzy stuff (was the answer actually helpful, did it stay on policy). I lean on agent-eval for this layer: define cases, attach assertions, get a gate result you can hand to process.exit

.

The second half is the one teams skip, and it's why their CI gate gets ripped out within a month. When a case fails in CI, the score alone is useless. judge_helpfulness: 0.4

on case #17 tells you nothing actionable at 2am. You need the trace β€” which model call ran, what the resolved prompt actually was after template interpolation, which tool the agent picked, what arguments it passed, what raw payload came back. That's what AgentLens captures: the full execution trace of how the agent reached the output that agent-eval just flagged.

This is the part to internalize: agent-eval scores the destination, AgentLens records the route. A red eval without a trace is a smoke alarm with no map of the house. The two ship as a unit because a gate you can't debug is a gate your team learns to ignore. The score tells you that case #17 regressed; the trace tells you the prompt edit changed tool selection from search_db

to web_fetch

, which is why the answer went stale. One number, one root cause, same workflow.

Here's a stripped-down CI harness. Golden inputs live in version control, agent-eval scores each output, AgentLens wraps the run so any failure carries its trace, and a non-zero exit code blocks the merge.

import { evaluate, assert } from "agent-eval";
import { trace, flush } from "agentlens";
import { runAgent } from "../src/agent";
import goldens from "./goldens.json";

// Each golden: { id, input, mustContain?, policy }
async function runSuite() {
  const results = await Promise.all(
    goldens.map((g) =>
      // AgentLens wraps the run: every model + tool step is recorded
      // under this trace id, with resolved inputs and raw outputs.
      trace({ name: g.id }, async (span) => {
        const output = await runAgent(g.input);

        const report = await evaluate({
          input: g.input,
          output,
          checks: [
            // Deterministic gates β€” cheap, zero flake.
            assert.isValidJson(),
            assert.contains(g.mustContain ?? []),
            assert.notContains(["as an AI", "I cannot verify"]),
            // Model-as-judge for the fuzzy contract.
            assert.judge({
              criterion: g.policy,
              threshold: 0.7,
            }),
          ],
        });

        // Attach the eval verdict to the trace so a CI failure
        // links straight to the steps that produced it.
        span.setAttribute("eval.passed", report.passed);
        span.setAttribute("eval.score", report.score);
        return { id: g.id, ...report, traceUrl: span.url };
      })
    )
  );

  await flush();

  const failed = results.filter((r) => !r.passed);
  for (const f of failed) {
    console.error(`FAIL ${f.id}  score=${f.score.toFixed(2)}`);
    console.error(`  why:   ${f.failingChecks.join(", ")}`);
    console.error(`  trace: ${f.traceUrl}`); // jump to the route, not just the score
  }

  if (failed.length > 0) {
    console.error(`\n${failed.length}/${results.length} cases regressed.`);
    process.exit(1); // <- this is the whole point: block the merge
  }
  console.log(`All ${results.length} cases passed.`);
}

runSuite();

Wire that into a workflow step and you're done:

- name: Agent regression gate
  run: npx tsx evals/run.ts
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

The failure output isn't score: 0.4

. It's FAIL refund_policy_edge score=0.55, why=judge_policy

, plus a trace link that drops you onto the exact tool call where the agent went off-script. That difference β€” score-plus-trace versus score-alone β€” is what makes engineers trust the gate instead of routing around it.

Two practical guardrails, because a slow or flaky gate gets disabled:

Stop treating eval scores as analytics you review and start treating them as tests that fail the build. The bar is not "do we have evals." The bar is "can a bad prompt edit fail to merge." If the answer is no, you don't have an eval suite β€” you have a dashboard you'll stop opening.

Score the output with agent-eval. Capture the route with AgentLens. Exit non-zero when it regresses. That's the whole discipline, and it's the difference between catching the drift on the PR and explaining it to a customer.

── more in #ai-agents 4 stories Β· sorted by recency
── more on @agent-eval 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/put-your-agent-evals…] indexed:0 read:5min 2026-06-16 Β· β€”