cd /news/large-language-models/switching-our-llm-as-judge-from-5-cl… · home topics large-language-models article
[ARTICLE · art-20617] src=dev.to pub= topic=large-language-models verified=true sentiment=· neutral

Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept

A team switched their LLM-as-judge evaluation from a single 1-to-5 helpfulness scale to four binary per-criterion assertions, raising Cohen's kappa from 0.47 to 0.78. The change required reworking the CI pipeline's pass-threshold logic, dashboards, and judge prompts, with the team settling on two threshold patterns: one for daily CI gates and another for weekly deep checks. The switch also reduced human spot-check time from 90 to 50 minutes per week by enabling faster per-criterion review.

read2 min publishedJun 3, 2026

A few months back our LLM-as-judge ran on a 1-to-5 helpfulness scale. The CI gate stayed green because we were averaging that score. Spot-checking against humans put Cohen's kappa at 0.47. The rubric was the problem, not the tooling. Same labellers re-rating on per-criterion binary got to 0.78. The CI pipeline had to learn the new shape. This post is the engineering work that came after the methodology decision.

Not a war story. Pattern share.

assertions:
  - type: llm-rubric
    rubric: "Score 1-5 on helpfulness"

assertions:
  - type: llm-rubric
    rubric: "Is the answer accurate? (yes/no)"
  - type: llm-rubric
    rubric: "Is the answer grounded in the context? (yes/no)"
  - type: llm-rubric
    rubric: "Does the answer follow the required format? (yes/no)"
  - type: llm-rubric
    rubric: "Does the answer address the question asked? (yes/no)"

The first thing that breaks: your existing pass-threshold logic. The old gate was "if avg-score is below 3.5, fail." The new gate has 4 separate signals.

We tried three threshold patterns:

We landed on option 2 for the daily CI gate and option 3 for the weekly deep check. Option 1 we dropped after a week of false positives.

(a) The dashboards. The old Datadog panel was one line. The new one is 4 lines plus a weighted-score line. Operators have to learn the new layout.

(b) The judge prompt itself. Each binary criterion needs its own prompt. We started with copy-paste-and-tweak; that was a mistake. The criteria need to be debated upfront and the prompts written carefully. Otherwise rater drift sneaks back in at the prompt level.

(c) Calibration set labelling cost. 4x the labels per trace. We compensated by reducing the calibration set from 200 traces to 100 traces. Still got stable kappa.

(a) Debugging regressions. When accuracy kappa drops while groundedness holds, the prompt change broke generation, not retrieval. The single-number score was averaging away the signal.

(b) Per-criterion alerting. Format compliance kappa cratering at 3am means the JSON parser broke. Set up a dedicated alert. Page on it.

(c) The human spot-check loop. Reviewing per-criterion is faster than re-reading the full 5-class rubric. Our weekly calibration job dropped from 90 minutes to 50.

The CI plumbing is the straightforward part. The harder work goes into the judge prompts themselves. Each binary criterion deserves the same care as a feature prompt: write it deliberately, version it in git, calibrate it against humans, and watch the per-criterion kappa over time.

Default to 3 or 4 criteria. We tried 6 and the labelling cost killed us. 2 hides too much. 4 was the sweet spot in our data; your traces may need different.

Anyone else done this switch? What criteria did you settle on, and how did the threshold tuning go?

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/switching-our-llm-as…] indexed:0 read:2min 2026-06-03 ·