# Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production

> Source: <https://dev.to/saurav_bhattacharya/who-grades-the-grader-your-llm-judge-is-an-unvalidated-model-in-production-pfi>
> Published: 2026-06-27 01:02:32+00:00

Everybody's eval stack has the same load-bearing assumption nobody audits: that the model-as-judge is telling the truth.

You wrote deterministic checks for the easy stuff â€” schema valid, no PII, latency under budget. Then you hit the subjective stuff â€” "is this answer actually helpful," "did the agent follow the user's intent," "is this summary faithful to the source" â€” and you reached for an LLM judge, because what else are you going to do. Now a model grades your model. And here's the part that should keep you up at night: **you never validated the grader.** You're shipping or blocking releases based on a 0â€“10 score from a prompt you wrote in twenty minutes, and you have no idea if that score correlates with anything a human would agree with.

I've watched teams trust a green judge dashboard for months, then discover the judge was handing out 8s to answers users hated. The judge wasn't broken in an obvious way. It was just *uncalibrated*, and uncalibrated graders fail silently â€” which is the worst way to fail.

Say it plainly: your LLM judge is a non-deterministic model making consequential decisions in your release pipeline. That is the exact thing you spent the last year learning to distrust. Somehow when it's wearing a lab coat and called an "evaluator," people grant it authority they'd never give the agent itself.

Three ways judges quietly lie:

None of these show up on a dashboard that only plots the average score. They show up when you go looking â€” and most teams never look, because the judge produces a clean metric and clean metrics feel like ground truth.

The fix isn't "stop using LLM judges." They're genuinely useful and you can't human-label every run. The fix is to **treat the judge as a system under test with its own ground-truth set.** You need a labeled golden set â€” a few hundred examples scored by humans you trust â€” and you measure your judge's agreement with those humans. Cohen's kappa, not raw accuracy, because raw agreement is inflated when most answers are "fine."

Here's the calibration check I run before any judge is allowed to gate anything:

``` js
import { judge } from "./llm-judge";

type Labeled = { input: string; output: string; humanScore: number };

// Quadratic-weighted agreement: penalize big disagreements more than small ones.
function weightedAgreement(human: number[], model: number[], max = 10): number {
  let num = 0, den = 0;
  for (let i = 0; i < human.length; i++) {
    const w = ((human[i] - model[i]) ** 2) / (max ** 2);
    num += 1 - w;
    den += 1;
  }
  return num / den; // 1.0 = perfect, lower = drifting from humans
}

// Position-bias probe: judge must agree with itself when we flip the order.
async function positionBias(pairs: { a: string; b: string }[]): Promise<number> {
  let flips = 0;
  for (const { a, b } of pairs) {
    const fwd = await judge.compare(a, b);   // "a" | "b"
    const rev = await judge.compare(b, a);   // "a" | "b" (b is now first)
    const consistent = (fwd === "a" && rev === "b") || (fwd === "b" && rev === "a");
    if (!consistent) flips++;
  }
  return flips / pairs.length; // want this near 0
}

export async function certifyJudge(golden: Labeled[]) {
  const scored = await Promise.all(
    golden.map(async (g) => (await judge.score(g.input, g.output)).value),
  );
  const agreement = weightedAgreement(golden.map((g) => g.humanScore), scored);
  const bias = await positionBias(buildPairs(golden));

  const passed = agreement >= 0.85 && bias <= 0.1;
  if (!passed) {
    throw new Error(
      `Judge not certified: agreement=${agreement.toFixed(2)} (need >=0.85), ` +
      `positionBias=${bias.toFixed(2)} (need <=0.10). Do not gate releases with this judge.`,
    );
  }
  return { agreement, bias };
}
```

This runs in CI on a schedule, not just once. Judges drift the same way agents do â€” provider updates the underlying model, your prompt template gets edited, your data distribution shifts â€” and a judge that agreed with humans in March can quietly diverge by June. If you only calibrated once at the start, you don't have a calibrated judge; you have a historical artifact.

Here's where the two halves of the workflow lock together, because a kappa of 0.6 is a smoke alarm, not a diagnosis.

[ agent-eval](https://www.npmjs.com/) is what runs the scoring and the gate â€” it's the layer holding your deterministic checks, your model-as-judge, the golden set, and the

`certifyJudge`

step above. It's the thing that tells you the judge agreement dropped below 0.85 and refuses to let the release through. That's the signal. But a failing number with no context is just an argument waiting to happen â€” "the judge is wrong," "no, the agent regressed," and nobody can settle it.That's the job of [ AgentLens](https://www.npmjs.com/): it captures the full trace behind every score â€” the exact prompt the judge saw, the candidate output, the resolved rubric, the judge's raw completion

That's the loop. **agent-eval scores and gates; AgentLens shows the trace so the score is debuggable.** Without the trace, a bad judge score is unfalsifiable â€” you can't tell a judge problem from an agent problem, so you end up trusting the number you should be interrogating. With it, every disagreement between judge and human becomes a concrete, inspectable artifact instead of a meeting.

If you're using a model-as-judge and you can't state your judge's agreement with human labels as a number, you are not running evals. You're running a vibe check with extra steps and a false sense of rigor. The judge is the most trusted, least audited component in your entire pipeline â€” and "the LLM said it was good" is doing a lot of unexamined work in your release decisions.

Certify the judge. Re-certify on a schedule. Keep the traces so every score can be challenged. A grader you haven't validated isn't measuring quality â€” it's laundering an opinion into a metric, and your green dashboard is the receipt.
