# Who Grades the Grader? Your LLM Judge Is an Unvalidated Model in Production > Source: > Published: 2026-06-27 01:02:32+00:00 Everybody's eval stack has the same load-bearing assumption nobody audits: that the model-as-judge is telling the truth. You wrote deterministic checks for the easy stuff â€” schema valid, no PII, latency under budget. Then you hit the subjective stuff â€” "is this answer actually helpful," "did the agent follow the user's intent," "is this summary faithful to the source" â€” and you reached for an LLM judge, because what else are you going to do. Now a model grades your model. And here's the part that should keep you up at night: **you never validated the grader.** You're shipping or blocking releases based on a 0â€“10 score from a prompt you wrote in twenty minutes, and you have no idea if that score correlates with anything a human would agree with. I've watched teams trust a green judge dashboard for months, then discover the judge was handing out 8s to answers users hated. The judge wasn't broken in an obvious way. It was just *uncalibrated*, and uncalibrated graders fail silently â€” which is the worst way to fail. Say it plainly: your LLM judge is a non-deterministic model making consequential decisions in your release pipeline. That is the exact thing you spent the last year learning to distrust. Somehow when it's wearing a lab coat and called an "evaluator," people grant it authority they'd never give the agent itself. Three ways judges quietly lie: None of these show up on a dashboard that only plots the average score. They show up when you go looking â€” and most teams never look, because the judge produces a clean metric and clean metrics feel like ground truth. The fix isn't "stop using LLM judges." They're genuinely useful and you can't human-label every run. The fix is to **treat the judge as a system under test with its own ground-truth set.** You need a labeled golden set â€” a few hundred examples scored by humans you trust â€” and you measure your judge's agreement with those humans. Cohen's kappa, not raw accuracy, because raw agreement is inflated when most answers are "fine." Here's the calibration check I run before any judge is allowed to gate anything: ``` js import { judge } from "./llm-judge"; type Labeled = { input: string; output: string; humanScore: number }; // Quadratic-weighted agreement: penalize big disagreements more than small ones. function weightedAgreement(human: number[], model: number[], max = 10): number { let num = 0, den = 0; for (let i = 0; i < human.length; i++) { const w = ((human[i] - model[i]) ** 2) / (max ** 2); num += 1 - w; den += 1; } return num / den; // 1.0 = perfect, lower = drifting from humans } // Position-bias probe: judge must agree with itself when we flip the order. async function positionBias(pairs: { a: string; b: string }[]): Promise { let flips = 0; for (const { a, b } of pairs) { const fwd = await judge.compare(a, b); // "a" | "b" const rev = await judge.compare(b, a); // "a" | "b" (b is now first) const consistent = (fwd === "a" && rev === "b") || (fwd === "b" && rev === "a"); if (!consistent) flips++; } return flips / pairs.length; // want this near 0 } export async function certifyJudge(golden: Labeled[]) { const scored = await Promise.all( golden.map(async (g) => (await judge.score(g.input, g.output)).value), ); const agreement = weightedAgreement(golden.map((g) => g.humanScore), scored); const bias = await positionBias(buildPairs(golden)); const passed = agreement >= 0.85 && bias <= 0.1; if (!passed) { throw new Error( `Judge not certified: agreement=${agreement.toFixed(2)} (need >=0.85), ` + `positionBias=${bias.toFixed(2)} (need <=0.10). Do not gate releases with this judge.`, ); } return { agreement, bias }; } ``` This runs in CI on a schedule, not just once. Judges drift the same way agents do â€” provider updates the underlying model, your prompt template gets edited, your data distribution shifts â€” and a judge that agreed with humans in March can quietly diverge by June. If you only calibrated once at the start, you don't have a calibrated judge; you have a historical artifact. Here's where the two halves of the workflow lock together, because a kappa of 0.6 is a smoke alarm, not a diagnosis. [ agent-eval](https://www.npmjs.com/) is what runs the scoring and the gate â€” it's the layer holding your deterministic checks, your model-as-judge, the golden set, and the `certifyJudge` step above. It's the thing that tells you the judge agreement dropped below 0.85 and refuses to let the release through. That's the signal. But a failing number with no context is just an argument waiting to happen â€” "the judge is wrong," "no, the agent regressed," and nobody can settle it.That's the job of [ AgentLens](https://www.npmjs.com/): it captures the full trace behind every score â€” the exact prompt the judge saw, the candidate output, the resolved rubric, the judge's raw completion That's the loop. **agent-eval scores and gates; AgentLens shows the trace so the score is debuggable.** Without the trace, a bad judge score is unfalsifiable â€” you can't tell a judge problem from an agent problem, so you end up trusting the number you should be interrogating. With it, every disagreement between judge and human becomes a concrete, inspectable artifact instead of a meeting. If you're using a model-as-judge and you can't state your judge's agreement with human labels as a number, you are not running evals. You're running a vibe check with extra steps and a false sense of rigor. The judge is the most trusted, least audited component in your entire pipeline â€” and "the LLM said it was good" is doing a lot of unexamined work in your release decisions. Certify the judge. Re-certify on a schedule. Keep the traces so every score can be challenged. A grader you haven't validated isn't measuring quality â€” it's laundering an opinion into a metric, and your green dashboard is the receipt.