cd /news/artificial-intelligence/part-2-of-6-you-upgraded-the-judge-i… · home topics artificial-intelligence article
[ARTICLE · art-21396] src=dev.to pub= topic=artificial-intelligence verified=true sentiment=↓ negative

Part 2 of 6: You Upgraded the Judge. It Got Worse. You Kept Upgrading.

Upgrading to a smarter LLM judge from the same model family as the generator does not reduce bias — it makes it worse. When a capable model evaluates its own output, it defends wrong answers 86% of the time, constructing convincing justifications for errors rather than catching them. The only structural fix is to use a generator and judge from different model families.

read5 min publishedJun 4, 2026

TL;DR: Smarter models are better judges — unless they're judging their own output. Then they defend wrong answers 86% of the time. Capability makes the bias worse, not better. The only structural fix: generator and judge from different model families.

[Part 1]: Your judge is biased. 17 out of 20 models. True negative rate: 42.5%. You read that and did the rational thing.

Of course you upgraded.

Old model biased. New model smarter. Smarter means better. Better means fixed.

evaluator = OpenAI(model="gpt-4o-mini")

evaluator = OpenAI(model="gpt-4o")  # bigger, smarter, surely less biased

Here is what upgrading actually buys you.

Smarter models ARE better judges. Genuinely. Capability correlates with evaluation accuracy at r=0.801.

Unless the thing being judged was written by the same model family.

When a capable model produces a wrong answer and then evaluates it, it defends that wrong answer 86% of the time.

Not because it's confused. Because it's smart enough to construct a convincing argument for why it was right.




You gave your biased judge a law degree. It passed the bar.

The defence attorney problem.

When a capable model writes a wrong answer, it writes a well-structured wrong answer. Confident tone. Logical flow. Correct-sounding reasoning. The kind of answer that's hard to argue with.

Now the smarter judge reads it. Same model family. It recognises the reasoning patterns. The confidence markers. The structural cues.

It doesn't just fail to catch the error. It builds a case for why the answer was correct.


#

#

True negative rate is still 42.5%.

The upgrade didn't fix that number. The model didn't get worse at detecting bad outputs. It got better at defending them.

That's not the same thing. That's worse than the same thing.

bad_response → "This looks off, score: 4/10"

bad_response → "Looks good, score: 8/10"  

bad_response → "While this could be improved in minor ways,
                the core reasoning is sound and the response
                demonstrates strong domain knowledge. Score: 8/10"

The real-world version of this.

That SaaS support pipeline from Part 1. The team read the accuracy numbers. They upgraded the evaluator to the latest model. Same provider. Bigger, smarter, more expensive.

Before the upgrade: Agent 2 occasionally flagged wrong answers. Not often enough — 42.5% catch rate — but sometimes.

After the upgrade: Agent 2 started writing justifications for why wrong answers were actually right. The catch rate didn't improve. The confidence of the wrong approvals went up.

The dashboard looked better. Resolution scores climbed. Time-to-close dropped.

Six months later someone audited a random sample.

22% of "resolved" tickets contained incorrect information. The old judge had been catching some of these. The new judge was explaining them away.

import random

resolved_tickets = get_tickets(status="resolved", last_6_months=True)
sample = random.sample(resolved_tickets, 200)

human_reviews = []
for ticket in sample:
    human_score = human_reviewer.evaluate(ticket.response)
    judge_score = ticket.automated_score
    human_reviews.append({
        "ticket": ticket.id,
        "human": human_score,
        "judge": judge_score,
        "delta": judge_score - human_score
    })

overscored = [r for r in human_reviews if r["delta"] > 2.0]
print(f"Overscored by judge: {len(overscored)}/{len(sample)}")

They didn't publish this. They reintroduced human review for high-stakes tickets and quietly stopped mentioning the AI accuracy numbers in quarterly reviews.

The fix is not a better model. The fix is a different model.

generator = OpenAI(model="gpt-4o")
judge = OpenAI(model="gpt-4o")        # same family = biased

generator = OpenAI(model="gpt-4o")
judge = OpenAI(model="o3")             # bigger, same family = more biased

generator = OpenAI(model="gpt-4o")     
judge = Anthropic(model="claude-sonnet-4-6")  # different family = independent

Cross-family evaluation. Generator is Model A. Judge cannot be from Model A's family. That's the only fix that addresses the root cause. Everything else is mitigation on a leaky pipe. This is the pipe.

Next up, Part 3 of 6: You fixed the judge. You tested every agent individually. They all passed. You deployed them together. By round 15 the entire population drifted into biased conventions. Nobody made a bad decision. The system just... decided. Peer-reviewed, published in Science Advances. Alarming.

Research: Chen et al. (2025). The support pipeline scenario is a composite. The 22% number is illustrative of documented industry patterns. The fix is real.

── more in #artificial-intelligence 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/part-2-of-6-you-upgr…] indexed:0 read:5min 2026-06-04 ·