The stale eval fixture that passed a broken model

wpnews.pro

cd /news/machine-learning/the-stale-eval-fixture-that-passed-a… · home › topics › machine-learning › article

[ARTICLE · art-43797] src=dev.to ↗ pub=2026-06-29T17:13Z topic=machine-learning verified=true sentiment=· neutral

The stale eval fixture that passed a broken model

An engineer at a company discovered that their eval suite's caching mechanism was using an incorrect cache key that omitted the model snapshot, causing stale cached scores to pass a broken model. The bug allowed a regression to ship because the cache served results from an older model version. The fix involves including the resolved model snapshot in the cache key and ensuring the cache never produces a false positive.

read4 min views1 publishedJun 29, 2026

A regression shipped green last month. The eval suite ran in CI, scored 0.94, the gate passed, we merged. Two days later support flagged that the summariser had started dropping the final line of multi-part answers. The eval should have caught it. The eval had not actually run on the new behaviour. It scored a cached result from three commits earlier, and the cache key was wrong.

This is the eval-infra bug nobody warns you about, because it only shows up after you optimise for speed. The eval itself was fine. The caching around it lied.

Our eval suite makes model calls, and model calls are slow and cost money. On a 600-case suite with an LLM-judge pass, a full run was about nine minutes and a few dollars. Running that on every push, including doc-only commits, was wasteful, so we cached: if nothing that affects a case's result changed, reuse the previous score.

That is the right instinct. The bug was in the definition of "nothing that affects the result changed."

Our key was a hash of two things: the test input (the prompt variables for that case) and the prompt template. If both matched a prior run, we served the cached score.

Here is what the key did not include: the model snapshot. We pinned the model by an alias in config, and when we bumped that alias to a new dated snapshot, the prompt template and the test inputs were byte-for-byte identical. Same key. The cache served scores generated by the old model for a suite running against the new one. The new model had the regression. The cache had the old model's clean scores. Green.

The rule a cache key has to obey is simple to say and easy to get wrong: the key must include every input that can change the output. For an eval case that is at least the test input, the prompt template, the model identity (the dated snapshot, not the alias), the judge model identity if you grade with one, and the eval config that controls scoring. Miss any one and a change to that input silently reuses a stale result.

This is the part you can lift. The cache key is a hash over the full tuple of result-affecting inputs, and the model identity is resolved to its concrete snapshot before hashing, not left as the floating alias.

import hashlib, json

def eval_cache_key(case, prompt_template, model_snapshot, judge_snapshot, eval_config):
    payload = {
        "input": case["vars"],
        "prompt": prompt_template,
        "model": model_snapshot,
        "judge": judge_snapshot,
        "eval_config": eval_config,   # thresholds, rubric, metric set
        "schema": 2,                  # bump to invalidate everything on purpose
    }
    blob = json.dumps(payload, sort_keys=True, separators=(",", ":"))
    return hashlib.sha256(blob.encode()).hexdigest()

Two things that matter more than they look:

sort_keys=True

so the hash is stable regardless of dict ordering. Without it the "same" inputs produce different keys and you cache nothing, which is the opposite failure but still a failure.schema

integer. When you change the cache logic itself, or you just want to force a clean rerun, bump it. It is a manual kill switch for the whole cache that does not require deleting files.And resolve the alias to the snapshot at the top of the run, once:

model = "gpt-4o"

model_snapshot = resolve_snapshot("gpt-4o")  # -> "gpt-4o-2024-08-06"

The second half of the fix is what happens on a cache miss or an ambiguous state. Ours failed open: if anything about the cache lookup threw, we treated it as "no entry, but also do not block," and in one code path that quietly meant "pass." A cache is a performance optimisation. It must never be able to produce a green that a real run would not. On any miss, any error, any version mismatch, the correct behaviour is run the eval for real. Slower is the acceptable failure. Green-by-accident is not.

We also added a cheap guard: the cache stores which model snapshot produced each score, and the runner asserts that the stored snapshot matches the current one before trusting any cached entry. If they differ, the entry is ignored and the case re-runs. That single assertion would have caught the original bug on its own.

The embarrassing number: the regression was live for nine days. Not because it was subtle in production, support caught it fast, but because when we went to the eval to confirm, the eval still said 0.94, so we spent two of those days looking everywhere except the cache. A gate that lies costs you more than a gate you do not have, because you trust it while it points you the wrong way.

When an eval passes something production then breaks, before you touch the model or the rubric:

source & further reading

dev.to — original article Is AI making us better developers, or is it just making us lazy reviewers? 🤖⚠️ How I Built a Job Search OS with Aurora PostgreSQL, AWS Bedrock and Vercel ML Zoomcamp Week 5: Deploying an ML Model

~/api · this article 200

$curl api.wpnews.pro/v1/news/the-stale-eval-fixture-t…

Read original on dev.to → dev.to/ethanwritesai/the-stale-eval-fixture-that…

mentioned entities

OpenAI

GPT-4

metadata

slugthe-stale-eval-fixture-that-passed-a-broken-model

topic#machine-learning

secondary2 topics

sentimentneutral

canonicaldev.to

navigation

← prevTenet Security reveals Agentjack…

next →RAG for codebases is hard. Trust…

── more in #machine-learning 4 stories · sorted by recency

techcrunch.com · 29 Jun · #machine-learning

Cursor now has a mobile app for guiding your coding agent on the go

dev.to · 29 Jun · #machine-learning

ML Zoomcamp Week 5: Deploying an ML Model

dev.to · 28 Jun · #machine-learning

Hardcoded System Prompts: An Anti-Pattern in Production

dev.to · 29 Jun · #machine-learning

AccessiBe Alternative: Why I Preferred TestGrid for Automated Accessibility Testing

── more on @openai 3 stories trending now

wpnews · 28 May · #ai-startups

[AINews] Cognition raises $1B in $26B Series D

wpnews · 5 Jun · #ai-agents

Miasma Worm Targets AI Coding Agents via GitHub Repos

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required