Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept

wpnews.pro

cd /news/large-language-models/switching-our-llm-as-judge-from-5-cl… · home › topics › large-language-models › article

[ARTICLE · art-20617] src=dev.to ↗ pub=2026-06-03T17:24Z topic=large-language-models verified=true sentiment=· neutral

Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept

A team switched their LLM-as-judge evaluation from a single 1-to-5 helpfulness scale to four binary per-criterion assertions, raising Cohen's kappa from 0.47 to 0.78. The change required reworking the CI pipeline's pass-threshold logic, dashboards, and judge prompts, with the team settling on two threshold patterns: one for daily CI gates and another for weekly deep checks. The switch also reduced human spot-check time from 90 to 50 minutes per week by enabling faster per-criterion review.

read2 min views16 publishedJun 3, 2026

A few months back our LLM-as-judge ran on a 1-to-5 helpfulness scale. The CI gate stayed green because we were averaging that score. Spot-checking against humans put Cohen's kappa at 0.47. The rubric was the problem, not the tooling. Same labellers re-rating on per-criterion binary got to 0.78. The CI pipeline had to learn the new shape. This post is the engineering work that came after the methodology decision.

Not a war story. Pattern share.

assertions:
  - type: llm-rubric
    rubric: "Score 1-5 on helpfulness"

assertions:
  - type: llm-rubric
    rubric: "Is the answer accurate? (yes/no)"
  - type: llm-rubric
    rubric: "Is the answer grounded in the context? (yes/no)"
  - type: llm-rubric
    rubric: "Does the answer follow the required format? (yes/no)"
  - type: llm-rubric
    rubric: "Does the answer address the question asked? (yes/no)"

The first thing that breaks: your existing pass-threshold logic. The old gate was "if avg-score is below 3.5, fail." The new gate has 4 separate signals.

We tried three threshold patterns:

We landed on option 2 for the daily CI gate and option 3 for the weekly deep check. Option 1 we dropped after a week of false positives.

(a) The dashboards. The old Datadog panel was one line. The new one is 4 lines plus a weighted-score line. Operators have to learn the new layout.

(b) The judge prompt itself. Each binary criterion needs its own prompt. We started with copy-paste-and-tweak; that was a mistake. The criteria need to be debated upfront and the prompts written carefully. Otherwise rater drift sneaks back in at the prompt level.

(c) Calibration set labelling cost. 4x the labels per trace. We compensated by reducing the calibration set from 200 traces to 100 traces. Still got stable kappa.

(a) Debugging regressions. When accuracy kappa drops while groundedness holds, the prompt change broke generation, not retrieval. The single-number score was averaging away the signal.

(b) Per-criterion alerting. Format compliance kappa cratering at 3am means the JSON parser broke. Set up a dedicated alert. Page on it.

(c) The human spot-check loop. Reviewing per-criterion is faster than re-reading the full 5-class rubric. Our weekly calibration job dropped from 90 minutes to 50.

The CI plumbing is the straightforward part. The harder work goes into the judge prompts themselves. Each binary criterion deserves the same care as a feature prompt: write it deliberately, version it in git, calibrate it against humans, and watch the per-criterion kappa over time.

Default to 3 or 4 criteria. We tried 6 and the labelling cost killed us. 2 hides too much. 4 was the sweet spot in our data; your traces may need different.

Anyone else done this switch? What criteria did you settle on, and how did the threshold tuning go?

source & further reading

dev.to — original article I launched to zero signups, then found 5 features nobody could reach building enterprise multi-agent workflows in .net with mistral Understanding Middleware in Deep Agents (With Runnable Examples)

~/api · this article 200

$curl api.wpnews.pro/v1/news/switching-our-llm-as-jud…

Read original on dev.to → dev.to/ethanwritesai/switching-our-llm-as-judge-…

mentioned entities

Datadog

metadata

slugswitching-our-llm-as-judge-from-5-class-to-binary-in-ci-the-patterns-we-kept

topic#large-language-models

secondary4 topics

sentimentneutral

canonicaldev.to

navigation

← prevThe Bloat

next →Free vLLM Course: Inference, Com…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 22 Jul · #large-language-models

How to Build Your First Production AI Agent with n8n

github.com · 22 Jul · #large-language-models

Show HN: Turn narrated screen recordings into data for AI agents (local, MIT)

github.com · 22 Jul · #large-language-models

Bw24 – from scratch rust+CUDA inference, every kernel tuned for sm_120a

techfundingnews.com · 22 Jul · #large-language-models

Arrakis raises $38M from Blossom and Accel, with Datadog CEO and OpenAI exec joining the round

── more on @datadog 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 8 Jul · #ai-tools

What's the Future of Clay?

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required