Your Model Upgrade Broke Three Workflows and the Tests Still Passed A developer warns that model upgrades can silently break agent workflows even when test suites pass, distinguishing the problem from drift. The post introduces a 'golden trace' approach with three tiers of regression detection: externally verifiable facts, statistical comparison against a fixed baseline, and model-as-judge used only for subjective evaluation. The proposed system runs deterministic checks synchronously to gate deployments while reserving slower judge-based analysis for offline review. Every team that runs agent evals eventually hits the same wall: your suite was green on Friday, you bumped the model from gpt-4.1 to gpt-4.1-2026-05 , and on Monday three workflows behave differently. Not broken in any way an exception catches. Just... different. A tool call that used to fire now doesn't. A summary that used to cite the doc now paraphrases it. Your pass rate is still 94%. Nobody knows what changed. This is the regression problem, and it is not the same as drift. Drift is what happens to a fixed configuration over time — the world moves, your data moves, and the agent's behavior slowly decays against a baseline it was never re-anchored to. I've written about catching that before. Regression is different. Regression is the discontinuity you introduce yourself: a model version bump, a prompt edit, a new tool in the registry, a temperature change someone made "to see what happens." The config changed at a known timestamp. The behavior changed with it. And because most agent output is non-deterministic prose, the change hides in the 6% you weren't looking at. The instinct is to throw a model-as-judge at it: "compare the old answer to the new answer, tell me if it got worse." Resist that instinct. A judge comparing two outputs gives you an opinion about which one reads better. It does not give you a fact about what structurally changed. And if you're judging the agent's reasoning trace, you've made it circular — the judge and the judged share a substrate, so there is no independent ground truth. You're asking a model to grade a model with no anchor. What you actually want is a golden trace : a pinned, known-good trajectory that the new version has to be compared against on axes the agent can't talk its way out of. The mistake people make is ranking evidence cheap-to-expensive. The axis that matters is independent → corruptible — how much the agent can forge the signal. Tier 1 — proof the agent can't fake. Did the new version still call the fetch invoice tool when the task required it? Did it still produce valid JSON against the schema? Did it finish inside the timeout? Did the file it claims to have written actually exist on disk? These are externally observable facts. The agent's prose cannot argue with them. If the golden trace called two tools and the new run calls one, that's a regression — full stop, no judgment required. Tier 2 — statistical signal vs a baseline the agent didn't author. Embedding similarity between the new output and the golden output. Did the diff actually change something meaningful, or is it cosmetic? Length and repetition deltas. None of this needs a model's opinion; it needs the old artifact as a fixed reference point. The agent didn't write the golden trace, so it can't game the comparison. Tier 3 — model-as-judge. Yes, there's a subjective tail: "is this summary still faithful to the source?" can't always be reduced to a number. But Tier 3 is a signal, never a verdict , and it has two hard constraints. It's offline only — it's metered, slow, and non-deterministic, so it can never sit in the hot path gating a deploy. And it may only inspect artifacts the judged agent didn't get to write — the final output against the source document, never the agent's own chain-of-thought, because grading reasoning with another model is the circular trap. Tier 1 + Tier 2 are your real-time gate: deterministic, effectively free, fast enough to block a CI run. They catch the overwhelming majority of regressions — the missing tool call, the broken schema, the answer that went from grounded to hand-wavy. Reserve the judge for the ~20% subjective tail, and label its output "opinion, not evidence" so nobody mistakes a 7/10 for a green light. Here's a regression gate built on this split. agent-eval scores the new run against the golden trace; the Tier 1+2 checks run synchronously and can fail the build, while the judge is fired off-line. js import { evaluate, tier1, tier2 } from "agent-eval"; import { loadTrace } from "agentlens"; interface RegressionResult { passed: boolean; blocking: string ; // Tier 1+2 failures — these fail CI advisory: string ; // Tier 3 opinions — logged, never blocking } async function checkRegression golden: string, // pinned trace id, known-good candidate: string, // new run after the model/prompt bump : Promise