Your Model Upgrade Broke Three Workflows and the Tests Still Passed

A developer warns that model upgrades can silently break agent workflows even when test suites pass, distinguishing the problem from drift. The post introduces a 'golden trace' approach with three tiers of regression detection: externally verifiable facts, statistical comparison against a fixed baseline, and model-as-judge used only for subjective evaluation. The proposed system runs deterministic checks synchronously to gate deployments while reserving slower judge-based analysis for offline review.

Every team that runs agent evals eventually hits the same wall: your suite was green on Friday, you bumped the model from gpt-4.1 to gpt-4.1-2026-05 , and on Monday three workflows behave differently. Not broken in any way an exception catches. Just... different. A tool call that used to fire now doesn't. A summary that used to cite the doc now paraphrases it. Your pass rate is still 94%. Nobody knows what changed. This is the regression problem, and it is not the same as drift. Drift is what happens to a fixed configuration over time — the world moves, your data moves, and the agent's behavior slowly decays against a baseline it was never re-anchored to. I've written about catching that before. Regression is different. Regression is the discontinuity you introduce yourself: a model version bump, a prompt edit, a new tool in the registry, a temperature change someone made "to see what happens." The config changed at a known timestamp. The behavior changed with it. And because most agent output is non-deterministic prose, the change hides in the 6% you weren't looking at. The instinct is to throw a model-as-judge at it: "compare the old answer to the new answer, tell me if it got worse." Resist that instinct. A judge comparing two outputs gives you an opinion about which one reads better. It does not give you a fact about what structurally changed. And if you're judging the agent's reasoning trace, you've made it circular — the judge and the judged share a substrate, so there is no independent ground truth. You're asking a model to grade a model with no anchor. What you actually want is a golden trace : a pinned, known-good trajectory that the new version has to be compared against on axes the agent can't talk its way out of. The mistake people make is ranking evidence cheap-to-expensive. The axis that matters is independent → corruptible — how much the agent can forge the signal. Tier 1 — proof the agent can't fake. Did the new version still call the fetch invoice tool when the task required it? Did it still produce valid JSON against the schema? Did it finish inside the timeout? Did the file it claims to have written actually exist on disk? These are externally observable facts. The agent's prose cannot argue with them. If the golden trace called two tools and the new run calls one, that's a regression — full stop, no judgment required. Tier 2 — statistical signal vs a baseline the agent didn't author. Embedding similarity between the new output and the golden output. Did the diff actually change something meaningful, or is it cosmetic? Length and repetition deltas. None of this needs a model's opinion; it needs the old artifact as a fixed reference point. The agent didn't write the golden trace, so it can't game the comparison. Tier 3 — model-as-judge. Yes, there's a subjective tail: "is this summary still faithful to the source?" can't always be reduced to a number. But Tier 3 is a signal, never a verdict , and it has two hard constraints. It's offline only — it's metered, slow, and non-deterministic, so it can never sit in the hot path gating a deploy. And it may only inspect artifacts the judged agent didn't get to write — the final output against the source document, never the agent's own chain-of-thought, because grading reasoning with another model is the circular trap. Tier 1 + Tier 2 are your real-time gate: deterministic, effectively free, fast enough to block a CI run. They catch the overwhelming majority of regressions — the missing tool call, the broken schema, the answer that went from grounded to hand-wavy. Reserve the judge for the ~20% subjective tail, and label its output "opinion, not evidence" so nobody mistakes a 7/10 for a green light. Here's a regression gate built on this split. agent-eval scores the new run against the golden trace; the Tier 1+2 checks run synchronously and can fail the build, while the judge is fired off-line. js import { evaluate, tier1, tier2 } from "agent-eval"; import { loadTrace } from "agentlens"; interface RegressionResult { passed: boolean; blocking: string ; // Tier 1+2 failures — these fail CI advisory: string ; // Tier 3 opinions — logged, never blocking } async function checkRegression golden: string, // pinned trace id, known-good candidate: string, // new run after the model/prompt bump : Promise<RegressionResult { // AgentLens gives us the full trace: every model + tool step, // resolved inputs, raw outputs — data the agent never got to rewrite. const g = await loadTrace golden ; const c = await loadTrace candidate ; const blocking: string = ; // Tier 1: externally observable proof. No model in the loop. if g.toolCalls.length == c.toolCalls.length { blocking.push tool-call count changed: ${g.toolCalls.length} - ${c.toolCalls.length} , ; } if tier1.validJson c.finalOutput, g.schema { blocking.push "output no longer matches schema" ; } if c.durationMs g.durationMs 1.5 { blocking.push latency regressed ${g.durationMs}ms - ${c.durationMs}ms ; } // Tier 2: statistical signal vs the golden artifact the agent didn't author. const sim = tier2.embeddingSimilarity c.finalOutput, g.finalOutput ; if sim < 0.82 { blocking.push output diverged from golden cosine ${sim.toFixed 2 } ; } // Tier 3: judge — OFFLINE, advisory only. Inspects final output vs the // SOURCE doc, never the candidate's own reasoning trace that'd be circular . const opinion = await evaluate.judge { artifact: c.finalOutput, groundTruth: c.sourceDoc, // not g.reasoning — independence matters rubric: "faithfulness-to-source", mode: "offline", } ; return { passed: blocking.length === 0, blocking, advisory: opinion.score < 7 ? judge faithfulness ${opinion.score}/10 : , }; } Notice what's load-bearing here. The Tier 1+2 checks only work because agentlens captured the whole trace — the actual tool calls, the resolved inputs, the raw outputs — and not a post-hoc summary the agent narrated about itself. If all you logged was the final string, you have nothing unforgeable to diff against. The trace is what makes the regression debuggable : when the gate fails on "tool-call count changed," you can open both trajectories side by side and see exactly which step the new model skipped. That's the actual division of labor. AgentLens captures how the agent got there — every model and tool step, every resolved input, every raw output. agent-eval scores what it produced — the tiered gate above. One without the other is half a system: a trace with no scoring is just expensive logging, and a score with no trace is a red light you can't diagnose. Pin a golden trace for every workflow you care about. Re-run it on every model bump and prompt change. Fail the build on Tier 1+2 — the structural facts the agent can't forge. Let the judge whisper its opinion about the subjective tail, offline, clearly labeled as a signal and not a verdict. The teams that get burned by a model upgrade aren't the ones whose evals failed. They're the ones whose evals passed because they were grading vibes instead of pinning facts. Pin the facts. The model will change under you whether you're watching or not.