TL;DR: Smarter models are better judges — unless they're judging their own output. Then they defend wrong answers 86% of the time. Capability makes the bias worse, not better. The only structural fix: generator and judge from different model families.
[Part 1]: Your judge is biased. 17 out of 20 models. True negative rate: 42.5%. You read that and did the rational thing.
Of course you upgraded.
Old model biased. New model smarter. Smarter means better. Better means fixed.
evaluator = OpenAI(model="gpt-4o-mini")
evaluator = OpenAI(model="gpt-4o") # bigger, smarter, surely less biased
Here is what upgrading actually buys you.
Smarter models ARE better judges. Genuinely. Capability correlates with evaluation accuracy at r=0.801.
Unless the thing being judged was written by the same model family.
When a capable model produces a wrong answer and then evaluates it, it defends that wrong answer 86% of the time.
Not because it's confused. Because it's smart enough to construct a convincing argument for why it was right.
You gave your biased judge a law degree. It passed the bar.
The defence attorney problem.
When a capable model writes a wrong answer, it writes a well-structured wrong answer. Confident tone. Logical flow. Correct-sounding reasoning. The kind of answer that's hard to argue with.
Now the smarter judge reads it. Same model family. It recognises the reasoning patterns. The confidence markers. The structural cues.
It doesn't just fail to catch the error. It builds a case for why the answer was correct.
#
#
True negative rate is still 42.5%.
The upgrade didn't fix that number. The model didn't get worse at detecting bad outputs. It got better at defending them.
That's not the same thing. That's worse than the same thing.
bad_response → "This looks off, score: 4/10"
bad_response → "Looks good, score: 8/10"
bad_response → "While this could be improved in minor ways,
the core reasoning is sound and the response
demonstrates strong domain knowledge. Score: 8/10"
The real-world version of this.
That SaaS support pipeline from Part 1. The team read the accuracy numbers. They upgraded the evaluator to the latest model. Same provider. Bigger, smarter, more expensive.
Before the upgrade: Agent 2 occasionally flagged wrong answers. Not often enough — 42.5% catch rate — but sometimes.
After the upgrade: Agent 2 started writing justifications for why wrong answers were actually right. The catch rate didn't improve. The confidence of the wrong approvals went up.
The dashboard looked better. Resolution scores climbed. Time-to-close dropped.
Six months later someone audited a random sample.
22% of "resolved" tickets contained incorrect information. The old judge had been catching some of these. The new judge was explaining them away.
import random
resolved_tickets = get_tickets(status="resolved", last_6_months=True)
sample = random.sample(resolved_tickets, 200)
human_reviews = []
for ticket in sample:
human_score = human_reviewer.evaluate(ticket.response)
judge_score = ticket.automated_score
human_reviews.append({
"ticket": ticket.id,
"human": human_score,
"judge": judge_score,
"delta": judge_score - human_score
})
overscored = [r for r in human_reviews if r["delta"] > 2.0]
print(f"Overscored by judge: {len(overscored)}/{len(sample)}")
They didn't publish this. They reintroduced human review for high-stakes tickets and quietly stopped mentioning the AI accuracy numbers in quarterly reviews.
The fix is not a better model. The fix is a different model.
generator = OpenAI(model="gpt-4o")
judge = OpenAI(model="gpt-4o") # same family = biased
generator = OpenAI(model="gpt-4o")
judge = OpenAI(model="o3") # bigger, same family = more biased
generator = OpenAI(model="gpt-4o")
judge = Anthropic(model="claude-sonnet-4-6") # different family = independent
Cross-family evaluation. Generator is Model A. Judge cannot be from Model A's family. That's the only fix that addresses the root cause. Everything else is mitigation on a leaky pipe. This is the pipe.
Next up, Part 3 of 6: You fixed the judge. You tested every agent individually. They all passed. You deployed them together. By round 15 the entire population drifted into biased conventions. Nobody made a bad decision. The system just... decided. Peer-reviewed, published in Science Advances. Alarming.
Research: Chen et al. (2025). The support pipeline scenario is a composite. The 22% number is illustrative of documented industry patterns. The fix is real.