Goodhart's Law Comes for Your Agent Evals: Why Your Green Dashboard Stops Meaning Anything A developer warns that Goodhart's Law undermines agent evaluation suites when pass/fail scores become targets, causing teams to optimize for test performance rather than real-world quality. The post advocates linking evaluation scores with execution traces—using tools like agent-eval and AgentLens—so that every gate decision is auditable and the measure remains meaningful. There is a specific moment in the life of every agent team that nobody puts on the roadmap. You build an eval suite. It catches real bugs. You wire it into CI as a release gate. The dashboard goes green. And then, somewhere over the next three months, the green stops meaning anything — while everyone keeps treating it like it does. This is Goodhart's Law, and it is coming for your agent evals whether you plan for it or not. "When a measure becomes a target, it ceases to be a good measure." The day your eval suite becomes the thing that decides what ships, it stops being a neutral measurement of quality and becomes a target your team optimizes toward. That is not a hypothetical risk. It is the default trajectory, and most teams only notice after a "fully passing" release lands in production and quietly makes everything worse. The decay is boring, which is exactly why it's dangerous. Here's the usual sequence: The endpoint is an agent with a 98% pass rate that is measurably worse for users — because the score is now measuring how well the agent satisfies the test, not how well it does the work. The map replaced the territory. The cleanest signal that Goodhart has arrived is this — a release passes the gate, and nobody on the team can explain why a specific borderline case passed. It just did. The score is a number with no narrative behind it. That's the real problem. A pass/fail bit is not a measurement you can reason about. It's a measurement you can only trust or distrust. And trust, unaudited, always decays toward green. This is exactly the seam where the two tools I lean on have to work as one unit, not as separate dashboards. agent-eval scores and gates the output. It runs the deterministic checks, the model-as-judge rubrics, the drift and hallucination signals — and it returns a verdict on AgentLens captures the trace of how the agent got there. Every model call and tool step, the resolved inputs after templating, not the raw template , and the raw outputs before any post-processing. Neither half is sufficient alone, and that's the entire point. A bare eval score is a target waiting to be gamed. A bare trace is forensic data with no verdict attached. You need agent-eval's score anchored to AgentLens's trace so that every gate decision carries a "show me why" attached to it. When a borderline case flips, you don't argue about whether the eval is too strict — you open the trace, see the resolved prompt and the exact tool output, and find out whether the agent actually reasoned correctly or got lucky on a phrasing. That linkage is what keeps the measure honest. The eval tells you the gate flipped; the trace tells you whether the flip was earned. The anti-pattern is a gate that returns a boolean and nothing else: // Goodhart bait: a verdict with no evidence behind it. async function gate testCase: TestCase : Promise