A regression shipped green last month. The eval suite ran in CI, scored 0.94, the gate passed, we merged. Two days later support flagged that the summariser had started dropping the final line of multi-part answers. The eval should have caught it. The eval had not actually run on the new behaviour. It scored a cached result from three commits earlier, and the cache key was wrong.
This is the eval-infra bug nobody warns you about, because it only shows up after you optimise for speed. The eval itself was fine. The caching around it lied.
Our eval suite makes model calls, and model calls are slow and cost money. On a 600-case suite with an LLM-judge pass, a full run was about nine minutes and a few dollars. Running that on every push, including doc-only commits, was wasteful, so we cached: if nothing that affects a case's result changed, reuse the previous score.
That is the right instinct. The bug was in the definition of "nothing that affects the result changed."
Our key was a hash of two things: the test input (the prompt variables for that case) and the prompt template. If both matched a prior run, we served the cached score.
Here is what the key did not include: the model snapshot. We pinned the model by an alias in config, and when we bumped that alias to a new dated snapshot, the prompt template and the test inputs were byte-for-byte identical. Same key. The cache served scores generated by the old model for a suite running against the new one. The new model had the regression. The cache had the old model's clean scores. Green.
The rule a cache key has to obey is simple to say and easy to get wrong: the key must include every input that can change the output. For an eval case that is at least the test input, the prompt template, the model identity (the dated snapshot, not the alias), the judge model identity if you grade with one, and the eval config that controls scoring. Miss any one and a change to that input silently reuses a stale result.
This is the part you can lift. The cache key is a hash over the full tuple of result-affecting inputs, and the model identity is resolved to its concrete snapshot before hashing, not left as the floating alias.
import hashlib, json
def eval_cache_key(case, prompt_template, model_snapshot, judge_snapshot, eval_config):
payload = {
"input": case["vars"],
"prompt": prompt_template,
"model": model_snapshot,
"judge": judge_snapshot,
"eval_config": eval_config, # thresholds, rubric, metric set
"schema": 2, # bump to invalidate everything on purpose
}
blob = json.dumps(payload, sort_keys=True, separators=(",", ":"))
return hashlib.sha256(blob.encode()).hexdigest()
Two things that matter more than they look:
sort_keys=True
so the hash is stable regardless of dict ordering. Without it the "same" inputs produce different keys and you cache nothing, which is the opposite failure but still a failure.schema
integer. When you change the cache logic itself, or you just want to force a clean rerun, bump it. It is a manual kill switch for the whole cache that does not require deleting files.And resolve the alias to the snapshot at the top of the run, once:
model = "gpt-4o"
model_snapshot = resolve_snapshot("gpt-4o") # -> "gpt-4o-2024-08-06"
The second half of the fix is what happens on a cache miss or an ambiguous state. Ours failed open: if anything about the cache lookup threw, we treated it as "no entry, but also do not block," and in one code path that quietly meant "pass." A cache is a performance optimisation. It must never be able to produce a green that a real run would not. On any miss, any error, any version mismatch, the correct behaviour is run the eval for real. Slower is the acceptable failure. Green-by-accident is not.
We also added a cheap guard: the cache stores which model snapshot produced each score, and the runner asserts that the stored snapshot matches the current one before trusting any cached entry. If they differ, the entry is ignored and the case re-runs. That single assertion would have caught the original bug on its own.
The embarrassing number: the regression was live for nine days. Not because it was subtle in production, support caught it fast, but because when we went to the eval to confirm, the eval still said 0.94, so we spent two of those days looking everywhere except the cache. A gate that lies costs you more than a gate you do not have, because you trust it while it points you the wrong way.
When an eval passes something production then breaks, before you touch the model or the rubric: