{"slug": "how-a-model-upgrade-silently-broke-our-extraction-prompt-and-how-we-caught-it", "title": "How a model upgrade silently broke our extraction prompt (and how we caught it)", "summary": "The article describes a real-world incident where a model upgrade from GPT-4o to GPT-4.1 silently broke a customer support ticket summarization system by renaming a JSON field from \"urgency\" to \"urgency_level,\" causing all tickets to be incorrectly tagged as \"low\" urgency for two weeks. The author identifies three common failure patterns when upgrading LLMs: format drift, reasoning regression, and tone shift. The article promotes a tool called PromptFork that can catch such regressions by running a test suite of representative prompts against a baseline model before shipping any model or prompt changes.", "body_md": "A friend's product summarizes customer support tickets using a fine-tuned LLM\n\nprompt. It worked perfectly on GPT-4o for six months. Then OpenAI deprecated\n\n4o, the team migrated to GPT-4.1, ran a smoke test in the playground, said\n\n\"looks fine,\" and shipped.\n\nTwo weeks later a customer escalated: \"Your urgency tagging is wrong on\n\nbasically everything since last Wednesday.\"\n\nThe prompt asked for `{\"intent\": \"...\", \"urgency\": \"low|medium|high\"}`\n\n. On\n\n4o, the model returned exactly that. On 4.1, it started returning\n\n`{\"intent\": \"...\", \"urgency_level\": \"...\"}`\n\n— semantically identical, but\n\nthe downstream classifier was indexing on `urgency`\n\nand silently fell\n\nthrough to a default value of \"low\" on 100% of new tickets.\n\nNobody saw it because:\n\n- The prompt didn't error. JSON parsed. Fields existed.\n- The unit tests checked the\n*prompt string*, not the*prompt output*. - The integration tests mocked the LLM call.\n- The output was indistinguishable from \"everything's fine and quiet.\"\n\nThis is the silent regression problem. Code has tests; prompts have vibes.\n\n## Three categories of model-swap failure\n\nAfter looking at a dozen of these incidents, the failures cluster into three\n\ngroups. Knowing which kind you're looking at tells you what to test.\n\n**1. Format drift.** The model decides to rename a field, drop a field, add\n\na field you didn't ask for, or change list ordering. JSON still parses. Your\n\ndownstream code breaks.\n\n**2. Reasoning regression.** The model is \"improved\" but loses a hidden\n\nconstraint your prompt depended on. Classic example: GPT-4 reliably extracted\n\n*all* requirements from a contract; GPT-4-Turbo extracted \"the most important\n\nones,\" dropping 15-20% of clauses. The format was fine. The data was wrong.\n\n**3. Tone shift.** Less common but expensive. The new model's outputs are\n\nmore verbose, less verbose, friendlier, blunter. If anything downstream\n\n(another model, a regex, a fuzzy matcher) was tuned to the old tone, it\n\nbreaks.\n\n## What the team should have had\n\nA test suite of 30 representative tickets, each with an expected JSON shape.\n\nOn model swap day:\n\n``` bash\n$ promptfork test summarize_ticket --baseline gpt-4o\n→ running v12 across [gpt-4.1] vs baseline [gpt-4o]\n✗ 30/30 ok, but 6 regressions detected\n  - urgency_field_renamed: 6 cases\n  - severity 2 (functional)\n```\n\nSix lines. Seven seconds. Two-week customer-facing bug avoided.\n\n## How to actually do this\n\nThe setup for the team that got bitten took four minutes:\n\n```\npip install promptfork\n\n# Save the current production prompt, version 1\npromptfork push summarize_ticket \\\n  --file prompts/summarize.txt \\\n  --message \"current prod\"\n\n# Pin 30 real tickets from your support inbox\nfor t in tickets/*.json; do\n  name=$(basename \"$t\" .json)\n  promptfork add-test summarize_ticket \"$name\" \\\n    --input ticket=\"$(cat \"$t\")\" \\\n    --rubric \"must return urgency in {low,medium,high}\"\ndone\n\n# Run baseline on 4o\npromptfork test summarize_ticket --models gpt-4o\n\n# Now upgrade — push the new prompt as v2 (or keep v1 and swap models)\n# Run with v1 (4o) as the baseline, get an LLM-judge regression report\npromptfork test summarize_ticket --baseline 1 --models gpt-4.1\n```\n\nThat's it. The `--baseline`\n\nflag is what catches drift — it pulls the\n\nbaseline output, runs the candidate, and asks Claude Haiku to compare them\n\nunder a strict \"only flag strictly worse\" rubric.\n\n## The CI version\n\nThe same command in a GitHub Action means *no prompt change ever ships*\n\nwithout running against a known-good baseline:\n\n```\n- uses: shaunvand/promptfork-cli@v0\n  with:\n    prompt: summarize_ticket\n    baseline: 1\n    api-key: ${{ secrets.PROMPTFORK_API_KEY }}\n```\n\nThe action exits non-zero on regression. Branch protection blocks the merge.\n\nIf you ship LLM features, you need this. The first time it catches a silent\n\nregression, it pays for itself a hundred times over. PromptFork has a free\n\ntier (3 prompts, 50 runs/mo) at [https://promptfork.online/diff](https://promptfork.online/diff) — set it up\n\nin five minutes, sleep better forever.", "url": "https://wpnews.pro/news/how-a-model-upgrade-silently-broke-our-extraction-prompt-and-how-we-caught-it", "canonical_source": "https://dev.to/shaun_vd_7562913ba77e1e0b/how-a-model-upgrade-silently-broke-our-extraction-prompt-and-how-we-caught-it-40ol", "published_at": "2026-05-23 08:57:46+00:00", "updated_at": "2026-05-23 09:02:00.994350+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "machine-learning", "developer-tools", "enterprise-software"], "entities": ["OpenAI", "GPT-4o", "GPT-4.1", "GPT-4", "GPT-4-Turbo"], "alternates": {"html": "https://wpnews.pro/news/how-a-model-upgrade-silently-broke-our-extraction-prompt-and-how-we-caught-it", "markdown": "https://wpnews.pro/news/how-a-model-upgrade-silently-broke-our-extraction-prompt-and-how-we-caught-it.md", "text": "https://wpnews.pro/news/how-a-model-upgrade-silently-broke-our-extraction-prompt-and-how-we-caught-it.txt", "jsonld": "https://wpnews.pro/news/how-a-model-upgrade-silently-broke-our-extraction-prompt-and-how-we-caught-it.jsonld"}}