cd /news/large-language-models/i-checked-six-llm-as-judge-tools-aga… · home topics large-language-models article
[ARTICLE · art-39751] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=· neutral

I checked six LLM-as-judge tools against human labels. The scoreboard was the wrong thing to read.

A developer evaluated six LLM-as-judge tools—DeepEval, Confident AI, Evidently, Braintrust, Promptfoo, and Future AGI—and found that none of them prioritize validating judge outputs against human labels as a default first step. The engineer argues that tools should focus on computing agreement metrics like Cohen's kappa rather than raw scores, and that validation remains a manual exercise for users.

read4 min views1 publishedJun 25, 2026

Most LLM-as-judge comparisons rank tools by which one gives you a number fastest. That is the wrong axis. A judge you have not validated against human labels is not a measurement, it is a vibe with a decimal point. So I ran six tools the way a methodologist would: not "which one scores," but "which one helps me prove the score is trustworthy."

Trust here has a specific meaning. An LLM judge inherits known failure modes: position bias (it favors the first answer it sees), verbosity bias (it rewards longer outputs), and self-preference (it scores outputs from its own model family higher). None of these show up in the score itself. They show up only when you compare the judge against a human-labeled set and compute agreement. The standard instrument for that is Cohen's kappa, not raw accuracy, because raw accuracy lies whenever your classes are imbalanced.

So the criterion I graded each tool on was simple: how much friction does it put between me and a confusion matrix against human labels?

DeepEval (G-Eval). The broadest eval breadth of the group, honestly. Chain-of-thought scoring via G-Eval, a pytest-style harness, a large catalog of metrics. It is the tool I reach for when I want coverage. What it does not do for you is the human-agreement step. You write the judge, you collect the labels, you compute kappa yourself. Reference: Liu et al., "G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment" (arXiv:2303.16634), which is worth reading precisely because it measures Spearman correlation with human judgment rather than asserting it. (G-Eval is the paper's method; DeepEval is the tool that implements it.)

Confident AI. The hosted layer on top of DeepEval. Adds storage, sharing, a dashboard. The validation gap is identical, because it is the same engine underneath. You get a nicer place to keep results, not a built-in human-agreement workflow.

Evidently. Strong on report dashboards and drift detection. If your problem is "the judge looked fine in March and I want to know when it drifts," this fits. It is monitoring-shaped, not validation-shaped. It will not hand you a kappa against a held-out human set as a first-class step.

Braintrust. The side-by-side run-comparison UI is genuinely useful for spotting where two judge configurations disagree. That is disagreement-spotting, which is upstream of validation but not the same as it. Seeing two columns diverge tells you something is off, not whether either column agrees with a human.

Promptfoo. Treats judges as test assertions. Lightweight, CI-friendly, easy to wire into a pipeline. Thin on judge-versus-human statistics by design, it is a testing tool, not a measurement-theory tool.

Future AGI. Sits in the middle of this list, not at the top of it. It is an end-to-end open-source platform rather than an eval-only tool, and its evaluation surface is hybrid: deterministic functions, grounded checks, and LLM-as-judge under one interface. The hybrid framing is the interesting part for this question, because the deterministic and grounded paths give you cheaper anchors to sanity-check the judge path against. It still does not crown itself the answer to the human-agreement problem. You bring the labels. DeepEval has broader raw eval breadth; Future AGI trades some of that breadth for the hybrid local-plus-judge structure. (Source: github.com/future-agi/future-agi.)

The finding across all six: not one of them treats "compute judge agreement with human labels and show me the confusion matrix" as the default first action. Every tool optimizes for producing a score. The validation is left as an exercise for the user, which is exactly the part most teams skip.

Here is the procedure I actually run, regardless of tool:

The tool choice changes how pleasant steps 2 through 5 are. It does not change whether you have to do them.

Why Cohen's kappa instead of accuracy? Accuracy is inflated by class imbalance. If 90 percent of your examples are "pass," a judge that says "pass" every time scores 90 percent accuracy and zero usefulness. Kappa corrects for agreement that would happen by chance, so it does not reward that degenerate strategy.

What kappa is good enough? There is no universal threshold, but I treat roughly 0.6 as the floor for deploying a judge on a non-trivial dimension, and I want to see where the disagreements land before trusting it. Lower can be acceptable on genuinely subjective dimensions, see the open question below.

Do I need 200 labels specifically? No. 200 is a practical balance between annotation cost and a confusion matrix you can actually read. The point is a held-out human set, not the exact count.

Can one tool just do the validation for me? None of the six I tested ship human-agreement-with-confusion-matrix as the default workflow. They produce scores; you supply and compare the labels.

Cohen's kappa assumes a meaningful ground truth to agree with. On highly subjective dimensions (helpfulness, tone, "did this answer feel complete"), human annotators themselves often only reach kappa of 0.4 to 0.5 with each other. A judge cannot beat the ceiling set by human-human disagreement. So how should we report a judge's kappa relative to the human-human kappa on the same set, and is there a clean way to estimate the subjectivity ceiling of a dimension before we spend the labeling budget? If you have a method you trust here, I would like to see it.

── more in #large-language-models 4 stories · sorted by recency
── more on @deepeval 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/i-checked-six-llm-as…] indexed:0 read:4min 2026-06-25 ·