A few months back our LLM-as-judge ran on a 1-to-5 helpfulness scale. The CI gate stayed green because we were averaging that score. Spot-checking against humans put Cohen's kappa at 0.47. The rubric was the problem, not the tooling. Same labellers re-rating on per-criterion binary got to 0.78. The CI pipeline had to learn the new shape. This post is the engineering work that came after the methodology decision.
Not a war story. Pattern share.
assertions:
- type: llm-rubric
rubric: "Score 1-5 on helpfulness"
assertions:
- type: llm-rubric
rubric: "Is the answer accurate? (yes/no)"
- type: llm-rubric
rubric: "Is the answer grounded in the context? (yes/no)"
- type: llm-rubric
rubric: "Does the answer follow the required format? (yes/no)"
- type: llm-rubric
rubric: "Does the answer address the question asked? (yes/no)"
The first thing that breaks: your existing pass-threshold logic. The old gate was "if avg-score is below 3.5, fail." The new gate has 4 separate signals.
We tried three threshold patterns:
We landed on option 2 for the daily CI gate and option 3 for the weekly deep check. Option 1 we dropped after a week of false positives.
(a) The dashboards. The old Datadog panel was one line. The new one is 4 lines plus a weighted-score line. Operators have to learn the new layout.
(b) The judge prompt itself. Each binary criterion needs its own prompt. We started with copy-paste-and-tweak; that was a mistake. The criteria need to be debated upfront and the prompts written carefully. Otherwise rater drift sneaks back in at the prompt level.
(c) Calibration set labelling cost. 4x the labels per trace. We compensated by reducing the calibration set from 200 traces to 100 traces. Still got stable kappa.
(a) Debugging regressions. When accuracy kappa drops while groundedness holds, the prompt change broke generation, not retrieval. The single-number score was averaging away the signal.
(b) Per-criterion alerting. Format compliance kappa cratering at 3am means the JSON parser broke. Set up a dedicated alert. Page on it.
(c) The human spot-check loop. Reviewing per-criterion is faster than re-reading the full 5-class rubric. Our weekly calibration job dropped from 90 minutes to 50.
The CI plumbing is the straightforward part. The harder work goes into the judge prompts themselves. Each binary criterion deserves the same care as a feature prompt: write it deliberately, version it in git, calibrate it against humans, and watch the per-criterion kappa over time.
Default to 3 or 4 criteria. We tried 6 and the labelling cost killed us. 2 hides too much. 4 was the sweet spot in our data; your traces may need different.
Anyone else done this switch? What criteria did you settle on, and how did the threshold tuning go?