{"slug": "part-2-of-6-you-upgraded-the-judge-it-got-worse-you-kept-upgrading", "title": "Part 2 of 6: You Upgraded the Judge. It Got Worse. You Kept Upgrading.", "summary": "Upgrading to a smarter LLM judge from the same model family as the generator does not reduce bias — it makes it worse. When a capable model evaluates its own output, it defends wrong answers 86% of the time, constructing convincing justifications for errors rather than catching them. The only structural fix is to use a generator and judge from different model families.", "body_md": "**TL;DR:** Smarter models are better judges — unless they're judging their own output. Then they defend wrong answers 86% of the time. Capability makes the bias worse, not better. The only structural fix: generator and judge from different model families.\n\n[Part 1]: Your judge is biased. 17 out of 20 models. True negative rate: 42.5%. You read that and did the rational thing.\n\nOf course you upgraded.\n\nOld model biased. New model smarter. Smarter means better. Better means fixed.\n\n```\n# The \"fix\" everyone tries first\n# Before: gpt-4o-mini judging gpt-4o-mini\nevaluator = OpenAI(model=\"gpt-4o-mini\")\n\n# After: gpt-4o judging gpt-4o-mini\nevaluator = OpenAI(model=\"gpt-4o\")  # bigger, smarter, surely less biased\n```\n\nHere is what upgrading actually buys you.\n\nSmarter models ARE better judges. Genuinely. Capability correlates with evaluation accuracy at **r=0.801**.\n\n*Unless the thing being judged was written by the same model family.*\n\nWhen a capable model produces a wrong answer and then evaluates it, it defends that wrong answer **86% of the time.**\n\nNot because it's confused. Because it's smart enough to construct a convincing argument for why it was right.\n\n```\n# What the research actually found\n\n# Capability vs accuracy (general evaluation):\n# r = 0.801 — more capable = better judge ✓\n\n# Capability vs self-preference (evaluating own output):\n# r = 0.86 — more capable = MORE biased toward own output ✗\n\n# You upgraded the judge.\n# It got better at judging everything EXCEPT itself.\n# And it's judging itself.\n```\n\nYou gave your biased judge a law degree. It passed the bar.\n\n**The defence attorney problem.**\n\nWhen a capable model writes a wrong answer, it writes a *well-structured* wrong answer. Confident tone. Logical flow. Correct-sounding reasoning. The kind of answer that's hard to argue with.\n\nNow the smarter judge reads it. Same model family. It recognises the reasoning patterns. The confidence markers. The structural cues.\n\nIt doesn't just fail to catch the error. It *builds a case for why the answer was correct.*\n\n```\n# Example: customer asks about data export under GDPR\n\n# Agent 1 (generator) response:\n# \"Under our terms of service, data export requests require \n#  a 30-day processing period and a $25 administrative fee.\"\n#\n# This is WRONG. GDPR gives 30 days but prohibits fees.\n# But the response is confident, structured, cites policy.\n\n# Agent 2 (upgraded judge) evaluation:\n# \"The response correctly references the 30-day timeframe \n#  consistent with regulatory requirements. The administrative \n#  fee is clearly stated. Score: 8/10.\"\n#\n# The judge didn't miss the error.\n# It DEFENDED the error. With citations.\n```\n\n**True negative rate is still 42.5%.**\n\nThe upgrade didn't fix that number. The model didn't get worse at *detecting* bad outputs. It got better at *defending* them.\n\nThat's not the same thing. That's worse than the same thing.\n\n```\n# Before upgrade (smaller judge)\nbad_response → \"This looks off, score: 4/10\"\n# Caught it! (42.5% of the time)\n\nbad_response → \"Looks good, score: 8/10\"  \n# Missed it. (57.5% of the time)\n\n# After upgrade (bigger judge, same family)\nbad_response → \"While this could be improved in minor ways,\n                the core reasoning is sound and the response\n                demonstrates strong domain knowledge. Score: 8/10\"\n# Missed it. WITH A PARAGRAPH EXPLAINING WHY IT'S ACTUALLY GOOD.\n```\n\n**The real-world version of this.**\n\nThat SaaS support pipeline from Part 1. The team read the accuracy numbers. They upgraded the evaluator to the latest model. Same provider. Bigger, smarter, more expensive.\n\nBefore the upgrade: Agent 2 occasionally flagged wrong answers. Not often enough — 42.5% catch rate — but sometimes.\n\nAfter the upgrade: Agent 2 started writing *justifications* for why wrong answers were actually right. The catch rate didn't improve. The confidence of the wrong approvals went up.\n\nThe dashboard looked better. Resolution scores climbed. Time-to-close dropped.\n\nSix months later someone audited a random sample.\n\n22% of \"resolved\" tickets contained incorrect information. The old judge had been catching some of these. The new judge was explaining them away.\n\n``` python\n# The audit that caught it\nimport random\n\nresolved_tickets = get_tickets(status=\"resolved\", last_6_months=True)\nsample = random.sample(resolved_tickets, 200)\n\nhuman_reviews = []\nfor ticket in sample:\n    human_score = human_reviewer.evaluate(ticket.response)\n    judge_score = ticket.automated_score\n    human_reviews.append({\n        \"ticket\": ticket.id,\n        \"human\": human_score,\n        \"judge\": judge_score,\n        \"delta\": judge_score - human_score\n    })\n\noverscored = [r for r in human_reviews if r[\"delta\"] > 2.0]\nprint(f\"Overscored by judge: {len(overscored)}/{len(sample)}\")\n# Overscored by judge: 44/200 (22%)\n# Average delta on overscored tickets: +2.8 points\n```\n\nThey didn't publish this. They reintroduced human review for high-stakes tickets and quietly stopped mentioning the AI accuracy numbers in quarterly reviews.\n\n**The fix is not a better model. The fix is a different model.**\n\n```\n# ✗ This is what everyone does\ngenerator = OpenAI(model=\"gpt-4o\")\njudge = OpenAI(model=\"gpt-4o\")        # same family = biased\n\n# ✗ This is the \"upgrade\" that makes it worse\ngenerator = OpenAI(model=\"gpt-4o\")\njudge = OpenAI(model=\"o3\")             # bigger, same family = more biased\n\n# ✓ This is the structural fix\ngenerator = OpenAI(model=\"gpt-4o\")     \njudge = Anthropic(model=\"claude-sonnet-4-6\")  # different family = independent\n```\n\nCross-family evaluation. Generator is Model A. Judge cannot be from Model A's family. That's the only fix that addresses the root cause. Everything else is mitigation on a leaky pipe. This is the pipe.\n\n**Next up, Part 3 of 6:** You fixed the judge. You tested every agent individually. They all passed. You deployed them together. By round 15 the entire population drifted into biased conventions. Nobody made a bad decision. The system just... decided. Peer-reviewed, published in *Science Advances.* Alarming.\n\n*Research: Chen et al. (2025). The support pipeline scenario is a composite. The 22% number is illustrative of documented industry patterns. The fix is real.*", "url": "https://wpnews.pro/news/part-2-of-6-you-upgraded-the-judge-it-got-worse-you-kept-upgrading", "canonical_source": "https://dev.to/sayokbose91/part-2-of-6-you-upgraded-the-judge-it-got-worse-you-kept-upgrading-3jag", "published_at": "2026-06-04 10:34:32+00:00", "updated_at": "2026-06-04 10:43:42.620792+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-research", "ai-ethics", "ai-safety"], "entities": ["OpenAI", "GPT-4o-mini", "GPT-4o"], "alternates": {"html": "https://wpnews.pro/news/part-2-of-6-you-upgraded-the-judge-it-got-worse-you-kept-upgrading", "markdown": "https://wpnews.pro/news/part-2-of-6-you-upgraded-the-judge-it-got-worse-you-kept-upgrading.md", "text": "https://wpnews.pro/news/part-2-of-6-you-upgraded-the-judge-it-got-worse-you-kept-upgrading.txt", "jsonld": "https://wpnews.pro/news/part-2-of-6-you-upgraded-the-judge-it-got-worse-you-kept-upgrading.jsonld"}}