{"slug": "the-stale-eval-fixture-that-passed-a-broken-model", "title": "The stale eval fixture that passed a broken model", "summary": "An engineer at a company discovered that their eval suite's caching mechanism was using an incorrect cache key that omitted the model snapshot, causing stale cached scores to pass a broken model. The bug allowed a regression to ship because the cache served results from an older model version. The fix involves including the resolved model snapshot in the cache key and ensuring the cache never produces a false positive.", "body_md": "A regression shipped green last month. The eval suite ran in CI, scored 0.94, the gate passed, we merged. Two days later support flagged that the summariser had started dropping the final line of multi-part answers. The eval should have caught it. The eval had not actually run on the new behaviour. It scored a cached result from three commits earlier, and the cache key was wrong.\n\nThis is the eval-infra bug nobody warns you about, because it only shows up after you optimise for speed. The eval itself was fine. The caching around it lied.\n\nOur eval suite makes model calls, and model calls are slow and cost money. On a 600-case suite with an LLM-judge pass, a full run was about nine minutes and a few dollars. Running that on every push, including doc-only commits, was wasteful, so we cached: if nothing that affects a case's result changed, reuse the previous score.\n\nThat is the right instinct. The bug was in the definition of \"nothing that affects the result changed.\"\n\nOur key was a hash of two things: the test input (the prompt variables for that case) and the prompt template. If both matched a prior run, we served the cached score.\n\nHere is what the key did not include: the model snapshot. We pinned the model by an alias in config, and when we bumped that alias to a new dated snapshot, the prompt template and the test inputs were byte-for-byte identical. Same key. The cache served scores generated by the old model for a suite running against the new one. The new model had the regression. The cache had the old model's clean scores. Green.\n\nThe rule a cache key has to obey is simple to say and easy to get wrong: the key must include every input that can change the output. For an eval case that is at least the test input, the prompt template, the model identity (the dated snapshot, not the alias), the judge model identity if you grade with one, and the eval config that controls scoring. Miss any one and a change to that input silently reuses a stale result.\n\nThis is the part you can lift. The cache key is a hash over the full tuple of result-affecting inputs, and the model identity is resolved to its concrete snapshot before hashing, not left as the floating alias.\n\n``` python\nimport hashlib, json\n\ndef eval_cache_key(case, prompt_template, model_snapshot, judge_snapshot, eval_config):\n    # model_snapshot / judge_snapshot are the resolved dated ids\n    # (e.g. \"gpt-4o-2024-08-06\"), NEVER the moving alias (\"gpt-4o\").\n    payload = {\n        \"input\": case[\"vars\"],\n        \"prompt\": prompt_template,\n        \"model\": model_snapshot,\n        \"judge\": judge_snapshot,\n        \"eval_config\": eval_config,   # thresholds, rubric, metric set\n        \"schema\": 2,                  # bump to invalidate everything on purpose\n    }\n    blob = json.dumps(payload, sort_keys=True, separators=(\",\", \":\"))\n    return hashlib.sha256(blob.encode()).hexdigest()\n```\n\nTwo things that matter more than they look:\n\n`sort_keys=True`\n\nso the hash is stable regardless of dict ordering. Without it the \"same\" inputs produce different keys and you cache nothing, which is the opposite failure but still a failure.`schema`\n\ninteger. When you change the cache logic itself, or you just want to force a clean rerun, bump it. It is a manual kill switch for the whole cache that does not require deleting files.And resolve the alias to the snapshot at the top of the run, once:\n\n```\n# Wrong: model id is the alias, so a provider-side snapshot bump is invisible.\nmodel = \"gpt-4o\"\n\n# Right: resolve to the concrete dated snapshot and key on THAT.\nmodel_snapshot = resolve_snapshot(\"gpt-4o\")  # -> \"gpt-4o-2024-08-06\"\n```\n\nThe second half of the fix is what happens on a cache miss or an ambiguous state. Ours failed open: if anything about the cache lookup threw, we treated it as \"no entry, but also do not block,\" and in one code path that quietly meant \"pass.\" A cache is a performance optimisation. It must never be able to produce a green that a real run would not. On any miss, any error, any version mismatch, the correct behaviour is run the eval for real. Slower is the acceptable failure. Green-by-accident is not.\n\nWe also added a cheap guard: the cache stores which model snapshot produced each score, and the runner asserts that the stored snapshot matches the current one before trusting any cached entry. If they differ, the entry is ignored and the case re-runs. That single assertion would have caught the original bug on its own.\n\nThe embarrassing number: the regression was live for nine days. Not because it was subtle in production, support caught it fast, but because when we went to the eval to confirm, the eval still said 0.94, so we spent two of those days looking everywhere except the cache. A gate that lies costs you more than a gate you do not have, because you trust it while it points you the wrong way.\n\nWhen an eval passes something production then breaks, before you touch the model or the rubric:", "url": "https://wpnews.pro/news/the-stale-eval-fixture-that-passed-a-broken-model", "canonical_source": "https://dev.to/ethanwritesai/the-stale-eval-fixture-that-passed-a-broken-model-5e21", "published_at": "2026-06-29 17:13:40+00:00", "updated_at": "2026-06-29 17:49:15.838928+00:00", "lang": "en", "topics": ["machine-learning", "mlops", "developer-tools"], "entities": ["OpenAI", "GPT-4"], "alternates": {"html": "https://wpnews.pro/news/the-stale-eval-fixture-that-passed-a-broken-model", "markdown": "https://wpnews.pro/news/the-stale-eval-fixture-that-passed-a-broken-model.md", "text": "https://wpnews.pro/news/the-stale-eval-fixture-that-passed-a-broken-model.txt", "jsonld": "https://wpnews.pro/news/the-stale-eval-fixture-that-passed-a-broken-model.jsonld"}}