# How a model upgrade silently broke our extraction prompt (and how we caught it)

> Source: <https://dev.to/shaun_vd_7562913ba77e1e0b/how-a-model-upgrade-silently-broke-our-extraction-prompt-and-how-we-caught-it-40ol>
> Published: 2026-05-23 08:57:46+00:00

A friend's product summarizes customer support tickets using a fine-tuned LLM

prompt. It worked perfectly on GPT-4o for six months. Then OpenAI deprecated

4o, the team migrated to GPT-4.1, ran a smoke test in the playground, said

"looks fine," and shipped.

Two weeks later a customer escalated: "Your urgency tagging is wrong on

basically everything since last Wednesday."

The prompt asked for `{"intent": "...", "urgency": "low|medium|high"}`

. On

4o, the model returned exactly that. On 4.1, it started returning

`{"intent": "...", "urgency_level": "..."}`

— semantically identical, but

the downstream classifier was indexing on `urgency`

and silently fell

through to a default value of "low" on 100% of new tickets.

Nobody saw it because:

- The prompt didn't error. JSON parsed. Fields existed.
- The unit tests checked the
*prompt string*, not the*prompt output*. - The integration tests mocked the LLM call.
- The output was indistinguishable from "everything's fine and quiet."

This is the silent regression problem. Code has tests; prompts have vibes.

## Three categories of model-swap failure

After looking at a dozen of these incidents, the failures cluster into three

groups. Knowing which kind you're looking at tells you what to test.

**1. Format drift.** The model decides to rename a field, drop a field, add

a field you didn't ask for, or change list ordering. JSON still parses. Your

downstream code breaks.

**2. Reasoning regression.** The model is "improved" but loses a hidden

constraint your prompt depended on. Classic example: GPT-4 reliably extracted

*all* requirements from a contract; GPT-4-Turbo extracted "the most important

ones," dropping 15-20% of clauses. The format was fine. The data was wrong.

**3. Tone shift.** Less common but expensive. The new model's outputs are

more verbose, less verbose, friendlier, blunter. If anything downstream

(another model, a regex, a fuzzy matcher) was tuned to the old tone, it

breaks.

## What the team should have had

A test suite of 30 representative tickets, each with an expected JSON shape.

On model swap day:

``` bash
$ promptfork test summarize_ticket --baseline gpt-4o
→ running v12 across [gpt-4.1] vs baseline [gpt-4o]
✗ 30/30 ok, but 6 regressions detected
  - urgency_field_renamed: 6 cases
  - severity 2 (functional)
```

Six lines. Seven seconds. Two-week customer-facing bug avoided.

## How to actually do this

The setup for the team that got bitten took four minutes:

```
pip install promptfork

# Save the current production prompt, version 1
promptfork push summarize_ticket \
  --file prompts/summarize.txt \
  --message "current prod"

# Pin 30 real tickets from your support inbox
for t in tickets/*.json; do
  name=$(basename "$t" .json)
  promptfork add-test summarize_ticket "$name" \
    --input ticket="$(cat "$t")" \
    --rubric "must return urgency in {low,medium,high}"
done

# Run baseline on 4o
promptfork test summarize_ticket --models gpt-4o

# Now upgrade — push the new prompt as v2 (or keep v1 and swap models)
# Run with v1 (4o) as the baseline, get an LLM-judge regression report
promptfork test summarize_ticket --baseline 1 --models gpt-4.1
```

That's it. The `--baseline`

flag is what catches drift — it pulls the

baseline output, runs the candidate, and asks Claude Haiku to compare them

under a strict "only flag strictly worse" rubric.

## The CI version

The same command in a GitHub Action means *no prompt change ever ships*

without running against a known-good baseline:

```
- uses: shaunvand/promptfork-cli@v0
  with:
    prompt: summarize_ticket
    baseline: 1
    api-key: ${{ secrets.PROMPTFORK_API_KEY }}
```

The action exits non-zero on regression. Branch protection blocks the merge.

If you ship LLM features, you need this. The first time it catches a silent

regression, it pays for itself a hundred times over. PromptFork has a free

tier (3 prompts, 50 runs/mo) at [https://promptfork.online/diff](https://promptfork.online/diff) — set it up

in five minutes, sleep better forever.
