How a model upgrade silently broke our extraction prompt (and how we caught it)

The article describes a real-world incident where a model upgrade from GPT-4o to GPT-4.1 silently broke a customer support ticket summarization system by renaming a JSON field from "urgency" to "urgency_level," causing all tickets to be incorrectly tagged as "low" urgency for two weeks. The author identifies three common failure patterns when upgrading LLMs: format drift, reasoning regression, and tone shift. The article promotes a tool called PromptFork that can catch such regressions by running a test suite of representative prompts against a baseline model before shipping any model or prompt changes.

A friend's product summarizes customer support tickets using a fine-tuned LLM prompt. It worked perfectly on GPT-4o for six months. Then OpenAI deprecated 4o, the team migrated to GPT-4.1, ran a smoke test in the playground, said "looks fine," and shipped. Two weeks later a customer escalated: "Your urgency tagging is wrong on basically everything since last Wednesday." The prompt asked for {"intent": "...", "urgency": "low|medium|high"} . On 4o, the model returned exactly that. On 4.1, it started returning {"intent": "...", "urgency level": "..."} — semantically identical, but the downstream classifier was indexing on urgency and silently fell through to a default value of "low" on 100% of new tickets. Nobody saw it because: - The prompt didn't error. JSON parsed. Fields existed. - The unit tests checked the prompt string , not the prompt output . - The integration tests mocked the LLM call. - The output was indistinguishable from "everything's fine and quiet." This is the silent regression problem. Code has tests; prompts have vibes. Three categories of model-swap failure After looking at a dozen of these incidents, the failures cluster into three groups. Knowing which kind you're looking at tells you what to test. 1. Format drift. The model decides to rename a field, drop a field, add a field you didn't ask for, or change list ordering. JSON still parses. Your downstream code breaks. 2. Reasoning regression. The model is "improved" but loses a hidden constraint your prompt depended on. Classic example: GPT-4 reliably extracted all requirements from a contract; GPT-4-Turbo extracted "the most important ones," dropping 15-20% of clauses. The format was fine. The data was wrong. 3. Tone shift. Less common but expensive. The new model's outputs are more verbose, less verbose, friendlier, blunter. If anything downstream another model, a regex, a fuzzy matcher was tuned to the old tone, it breaks. What the team should have had A test suite of 30 representative tickets, each with an expected JSON shape. On model swap day: bash $ promptfork test summarize ticket --baseline gpt-4o → running v12 across gpt-4.1 vs baseline gpt-4o ✗ 30/30 ok, but 6 regressions detected - urgency field renamed: 6 cases - severity 2 functional Six lines. Seven seconds. Two-week customer-facing bug avoided. How to actually do this The setup for the team that got bitten took four minutes: pip install promptfork Save the current production prompt, version 1 promptfork push summarize ticket \ --file prompts/summarize.txt \ --message "current prod" Pin 30 real tickets from your support inbox for t in tickets/ .json; do name=$ basename "$t" .json promptfork add-test summarize ticket "$name" \ --input ticket="$ cat "$t" " \ --rubric "must return urgency in {low,medium,high}" done Run baseline on 4o promptfork test summarize ticket --models gpt-4o Now upgrade — push the new prompt as v2 or keep v1 and swap models Run with v1 4o as the baseline, get an LLM-judge regression report promptfork test summarize ticket --baseline 1 --models gpt-4.1 That's it. The --baseline flag is what catches drift — it pulls the baseline output, runs the candidate, and asks Claude Haiku to compare them under a strict "only flag strictly worse" rubric. The CI version The same command in a GitHub Action means no prompt change ever ships without running against a known-good baseline: - uses: shaunvand/promptfork-cli@v0 with: prompt: summarize ticket baseline: 1 api-key: ${{ secrets.PROMPTFORK API KEY }} The action exits non-zero on regression. Branch protection blocks the merge. If you ship LLM features, you need this. The first time it catches a silent regression, it pays for itself a hundred times over. PromptFork has a free tier 3 prompts, 50 runs/mo at https://promptfork.online/diff https://promptfork.online/diff — set it up in five minutes, sleep better forever.