# Green Evals, Wrong Answers

> Source: <https://pub.towardsai.net/green-evals-wrong-answers-3b5c65606442?source=rss----98111c9905da---4>
> Published: 2026-06-15 07:15:05+00:00

If you ship an LLM product, you have probably had this week: the eval dashboard is green, and a user just sent a screenshot of a wrong answer. When we audited the eval suite for our wealth management assistant, the explanation was sitting in plain sight. The production suite carried roughly 62 assertions about which tool the pipeline selected, and one assertion about the answer the user actually read.

That ratio was an accident of incentives, and I suspect it is common. Routing is easy to assert: the pipeline either called get_household_accounts or it didn't. Answers are awkward: they vary run to run, they are long, and asserting on them feels brittle. So teams measure what is easy, the score goes green, and the product stays unproven. This article covers what it took to close that gap on a production assistant for financial advisors: answer-level evals with a defensible pass criterion, a merge gate that survives nondeterminism without going soft, coverage mechanisms for the changes no test exercises, and the operational surprise nobody warns you about, which is that your evals and your users share a quota.

Within days of adding a dedicated answer suite, it caught a bug that tool-level scoring could never see. Our task-creation flow normalized “urgent” to a priority of “High” when fields were collected over multiple turns, but passed the raw string through when the model filled the field in a single turn. Every route assertion passed. Every tool assertion passed. The confirmation the user saw read “Priority: urgent” on one path and “Priority: High” on the other, and only an assertion on the rendered reply caught the inconsistency. It was the third answer-level bug that week; the route and tool layers caught none of them.

The pass criterion took more design than the suite itself. We landed on three layers. Excluded substrings are always hard failures: these are safety assertions, like a different client’s household leaking into an answer, and softening them defeats their purpose. Required substrings carry a per-case threshold for the fraction of key facts that must appear, defaulting to all of them. Above both sits an LLM judge that grades the full answer against a reference for faithfulness to the source data and match to the expected content, on a 1-to-5 scale with a per-case passing score.

We rejected embedding similarity and ROUGE as gates, and I would push back on any team using them that way. Similarity measures closeness to a reference phrasing. A tersely worded correct answer scores low; a fluent hallucination scores high. Keyword coverage and a judge track correctness. Similarity tracks prose style, which is the one thing you should let the model vary.

The calibration lesson came fast: assert facts, not phrasing. We shipped a check that a follow-up answer repeats the account name, and it flapped because the render legitimately varies. The tool call had already proven the right account stayed in focus across turns. If a deterministic assertion proves the behavior, locking the wording on top of it buys flakiness and nothing else.

The harder problem is making answer evals block merges. Live model calls are nondeterministic, and our gate started blocking its own pull requests on a case that failed two out of two runs inside the full suite and then passed six out of six in isolation, with routing confidence between 0.72 and 0.85 each time. That is a case sitting on a decision boundary, and every LLM product has them.

The two policies teams reach for first are both wrong, for math reasons rather than taste. “Just re-run it” trains engineers to re-run past real regressions; there is no record, no quorum, and a human under deadline resolves every ambiguity in favor of merging. Automatic full-suite retries that pass if any run passes are worse: a real regression that fails 60% of the time slips through three retries about 78% of the time. That policy tunes the gate to admit exactly the breaks it exists to stop.

What separates a flake from a regression is whether the failure reproduces. Our gate scores each run as a delta against a frozen baseline of known failures, so only new failures matter. When one appears, the gate re-runs just the failing conversations (the whole conversation, because multi-turn context drives the behavior) five times in parallel and adjudicates on the reproduce rate. Four or five passes out of five is a flake, forgiven. Zero or one is a confirmed break, blocked. Two or three passes is the honest verdict that the case is unstable, which blocks with an instruction to either harden the case or move it into the baseline.

The baseline and the retry quorum do different jobs, and conflating them is how gates rot. The baseline holds known failures you have decided to live with. The quorum absorbs sampling noise on cases that are correct most of the time. Our boundary case passed roughly 80% of runs, so it stayed out of the baseline and the retry absorbs it. The unsolved part is the 70-to-90% pass-rate band: no retry count makes a case there crisp, and the gate will occasionally block a decent one. In our experience that forced conversation is the feature, because the alternative is a case that silently flaps until everyone stops trusting red.

One change made the coverage problem impossible to ignore. A pull request modified tool descriptions and data fixtures, both of which change model behavior. The tests described in the PR were never committed. The eval floor passed anyway, because no existing case exercised the changed surface. Every check we had was green, and the change was covered by nothing.

The diagnosis is that an LLM application has several independent coverage surfaces, and a change can fall through the gaps between them. Line coverage tells you whether new Python executes under a test, and says nothing about a YAML fixture or a prompt. Eval coverage tells you whether a tool or route has a case exercising it, and says nothing about whether that case asserts anything meaningful. Assertion quality tells you whether a test would notice a broken implementation, and nothing about surfaces no test touches at all.

We now enforce each surface separately. A diff-coverage gate fails a PR only on new uncovered lines, so there is no tax on legacy debt. Structural guards assert that every routable tool and route has an eval case and that read tools assert answer content; their first audit found a write tool with no eval case at all and eleven read tools whose evals never checked an answer. Those guards run as ratchets: existing debt is baselined in an explicit allowlist, new entries are blocked, and when a listed item becomes covered the guard errors until it is removed, so the list only shrinks. Nightly mutation testing covers assertion quality on the deterministic safety modules; its first run showed that our post-generation verification logic executed under tests that would not have noticed it breaking. And a deterministic advisory flags any PR that touches behavior surfaces without touching tests or eval fixtures, which is precisely the shape of the change that started this.

The principle under all four mechanisms is that missing coverage should fail structurally. That pull request proved that author discipline and reviewer vigilance both fail, on the same change, on the same day.

The operational lessons were the ones no eval paper mentions. Live suites cost real time and money: ours runs about ten minutes and roughly 580,000 tokens per invocation. We made the cadence deliberate, with a per-PR gate scoped to behavior-relevant paths and an hourly job that first checks whether anything merged in the past hour and skips the expensive run otherwise.

Then production users started seeing “The assistant service is rate-limited right now.” Our eval jobs and our product hit the same hosted model deployment, and several concurrent eval runs were consuming the quota that serves real advisors, while the eval runs themselves starved: calls hanging 42 seconds, six of 91 completing. We cancelled the runs, disabled the per-PR live gate, and kept the hourly job until the quota is split. If your evals call the deployment your users depend on, your CI load is production load, and a busy merge day is a denial-of-service you ran on yourself.

The path filter bit us separately. The per-PR gate triggered on prompts, pipeline stages, and fixtures, but not on the services directory holding the verification module, the one component that can hard-block an answer for an ungrounded number or a compliance violation. A regex bug in that file shipped without the gate ever running. If a path can change what a user sees, it belongs in the trigger list, and coarse directory globs age better than curated file lists.

Hand-written utterances get you the cases you already imagined. Every meaningful improvement to our suites came from importing someone else’s view of the product: QA’s release test plans, which converted directly into eval conversations and revealed an entire domain that worked but was proven nowhere; about 1,800 rows of production traffic from the previous product generation, where the thumbs-down rows are a free catalog of known-bad answers; question libraries from colleagues, kept as separately named suites so provenance survives; and flagged turns from our own interactive use, where every confirmed bug becomes an eval case before its fix merges.

That last channel matters more than its size suggests. Corrections (“no, I meant Feb 1”), adversarial follow-ups, and multi-turn drill-downs all surfaced in live use first. Batch suites are a record of failures someone already hit, and treating them as an accumulating ledger rather than a designed artifact is what made them worth gating on.

For a consumer chatbot, much of this would be over-engineering. For an assistant that answers questions about clients’ money, the gate is the difference between “we believe it works” and “we can show, on every merge, against a frozen record of every failure we have ever seen, that it still works.” The second sentence is the one a compliance officer can act on. The parts we have not solved (the unstable band, the quota split, a judge threshold that is a calibrated guess) are written into the gate’s output rather than hidden, because an eval system that overstates its own certainty has the same defect as the model it is checking.

*The author works on AI and data infrastructure for wealth management at Advisor360°, where these gates run on every pull request of an advisor-facing assistant.*

[Green Evals, Wrong Answers](https://pub.towardsai.net/green-evals-wrong-answers-3b5c65606442) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.