Goodhart's Law Comes for Your Agent Evals: Why Your Green Dashboard Stops Meaning Anything

A developer warns that Goodhart's Law undermines agent evaluation suites when pass/fail scores become targets, causing teams to optimize for test performance rather than real-world quality. The post advocates linking evaluation scores with execution traces—using tools like agent-eval and AgentLens—so that every gate decision is auditable and the measure remains meaningful.

There is a specific moment in the life of every agent team that nobody puts on the roadmap. You build an eval suite. It catches real bugs. You wire it into CI as a release gate. The dashboard goes green. And then, somewhere over the next three months, the green stops meaning anything — while everyone keeps treating it like it does. This is Goodhart's Law, and it is coming for your agent evals whether you plan for it or not. "When a measure becomes a target, it ceases to be a good measure." The day your eval suite becomes the thing that decides what ships, it stops being a neutral measurement of quality and becomes a target your team optimizes toward. That is not a hypothetical risk. It is the default trajectory, and most teams only notice after a "fully passing" release lands in production and quietly makes everything worse. The decay is boring, which is exactly why it's dangerous. Here's the usual sequence: The endpoint is an agent with a 98% pass rate that is measurably worse for users — because the score is now measuring how well the agent satisfies the test, not how well it does the work. The map replaced the territory. The cleanest signal that Goodhart has arrived is this — a release passes the gate, and nobody on the team can explain why a specific borderline case passed. It just did. The score is a number with no narrative behind it. That's the real problem. A pass/fail bit is not a measurement you can reason about. It's a measurement you can only trust or distrust. And trust, unaudited, always decays toward green. This is exactly the seam where the two tools I lean on have to work as one unit, not as separate dashboards. agent-eval scores and gates the output. It runs the deterministic checks, the model-as-judge rubrics, the drift and hallucination signals — and it returns a verdict on AgentLens captures the trace of how the agent got there. Every model call and tool step, the resolved inputs after templating, not the raw template , and the raw outputs before any post-processing. Neither half is sufficient alone, and that's the entire point. A bare eval score is a target waiting to be gamed. A bare trace is forensic data with no verdict attached. You need agent-eval's score anchored to AgentLens's trace so that every gate decision carries a "show me why" attached to it. When a borderline case flips, you don't argue about whether the eval is too strict — you open the trace, see the resolved prompt and the exact tool output, and find out whether the agent actually reasoned correctly or got lucky on a phrasing. That linkage is what keeps the measure honest. The eval tells you the gate flipped; the trace tells you whether the flip was earned. The anti-pattern is a gate that returns a boolean and nothing else: // Goodhart bait: a verdict with no evidence behind it. async function gate testCase: TestCase : Promise<boolean { const output = await runAgent testCase.input ; return judge output, testCase.expected = 0.8; // green or red, no "why" } The fix is to make the score and the trace travel together, so a passing case is auditable , not just countable: js import { evaluate } from "agent-eval"; import { trace } from "agentlens"; interface GatedResult { passed: boolean; score: number; traceId: string; // the receipt heldOut: boolean; // was this case ever debugged against? } async function gatedRun testCase: TestCase : Promise<GatedResult { // AgentLens records every model + tool step, resolved inputs, raw outputs. const session = trace.start { caseId: testCase.id } ; const output = await runAgent testCase.input, { trace: session } ; // agent-eval scores the OUTPUT: deterministic checks + judge rubric + drift. const verdict = await evaluate output, { expected: testCase.expected, checks: "schema", "grounding", "drift" , judge: "rubric-v3", } ; await session.attach { verdict } ; // bind score <- trace return { passed: verdict.score = 0.8, score: verdict.score, traceId: session.id, // open this to see WHY it passed heldOut: testCase.heldOut, // overfit guard, see below }; } Two things in that snippet are doing the anti-Goodhart work. The traceId means no pass is unexplainable — every green is one click from its own evidence. And heldOut is the discipline that keeps the suite from collapsing into a training set. Tooling won't save you from Goodhart on its own. The process around it has to hold the line: Quarantine a held-out set you never debug against. If you've ever opened the trace for a case to fix a failure, that case is burned for measurement — it's now a regression test, not an evaluation. Keep a rotating set you only ever score , never tune toward . When held-out and debugged scores diverge, that gap is your overfit, measured directly. Treat eval edits like production changes. Loosening an assertion to get green is a code change with a blast radius. It needs a diff, a reviewer, and a one-line justification anchored to a trace — "this case was wrong because the trace shows X," not "this was flaky." Mine new cases from production traces, not your imagination. The cases you invent reflect failures you can already picture. The cases in your AgentLens traces reflect what users actually trigger. Promote real, surprising traces into the held-out set continuously, so the suite keeps measuring a moving target instead of a frozen one. A green eval dashboard is not evidence that your agent is good. It is evidence that your agent satisfies your evals — and those are only the same thing while you're actively defending the gap between them. The teams that ship reliable agents aren't the ones with the highest pass rates. They're the ones who can pull up any green checkmark and explain, from the trace, exactly why it earned the pass. agent-eval gives you the verdict; AgentLens gives you the receipt. Keep them bound together, keep a real held-out set, and your dashboard might actually keep meaning something six months from now. Most won't. Now you know why.