{"slug": "switching-our-llm-as-judge-from-5-class-to-binary-in-ci-the-patterns-we-kept", "title": "Switching our LLM-as-judge from 5-class to binary in CI: the patterns we kept", "summary": "A team switched their LLM-as-judge evaluation from a single 1-to-5 helpfulness scale to four binary per-criterion assertions, raising Cohen's kappa from 0.47 to 0.78. The change required reworking the CI pipeline's pass-threshold logic, dashboards, and judge prompts, with the team settling on two threshold patterns: one for daily CI gates and another for weekly deep checks. The switch also reduced human spot-check time from 90 to 50 minutes per week by enabling faster per-criterion review.", "body_md": "A few months back our LLM-as-judge ran on a 1-to-5 helpfulness scale. The CI gate stayed green because we were averaging that score. Spot-checking against humans put Cohen's kappa at 0.47. The rubric was the problem, not the tooling. Same labellers re-rating on per-criterion binary got to 0.78. The CI pipeline had to learn the new shape. This post is the engineering work that came after the methodology decision.\n\nNot a war story. Pattern share.\n\n```\n# Before: single 5-class assertion\nassertions:\n  - type: llm-rubric\n    rubric: \"Score 1-5 on helpfulness\"\n\n# After: 4 binary assertions per criterion\nassertions:\n  - type: llm-rubric\n    rubric: \"Is the answer accurate? (yes/no)\"\n  - type: llm-rubric\n    rubric: \"Is the answer grounded in the context? (yes/no)\"\n  - type: llm-rubric\n    rubric: \"Does the answer follow the required format? (yes/no)\"\n  - type: llm-rubric\n    rubric: \"Does the answer address the question asked? (yes/no)\"\n```\n\nThe first thing that breaks: your existing pass-threshold logic. The old gate was \"if avg-score is below 3.5, fail.\" The new gate has 4 separate signals.\n\nWe tried three threshold patterns:\n\nWe landed on option 2 for the daily CI gate and option 3 for the weekly deep check. Option 1 we dropped after a week of false positives.\n\n(a) The dashboards. The old Datadog panel was one line. The new one is 4 lines plus a weighted-score line. Operators have to learn the new layout.\n\n(b) The judge prompt itself. Each binary criterion needs its own prompt. We started with copy-paste-and-tweak; that was a mistake. The criteria need to be debated upfront and the prompts written carefully. Otherwise rater drift sneaks back in at the prompt level.\n\n(c) Calibration set labelling cost. 4x the labels per trace. We compensated by reducing the calibration set from 200 traces to 100 traces. Still got stable kappa.\n\n(a) Debugging regressions. When accuracy kappa drops while groundedness holds, the prompt change broke generation, not retrieval. The single-number score was averaging away the signal.\n\n(b) Per-criterion alerting. Format compliance kappa cratering at 3am means the JSON parser broke. Set up a dedicated alert. Page on it.\n\n(c) The human spot-check loop. Reviewing per-criterion is faster than re-reading the full 5-class rubric. Our weekly calibration job dropped from 90 minutes to 50.\n\nThe CI plumbing is the straightforward part. The harder work goes into the judge prompts themselves. Each binary criterion deserves the same care as a feature prompt: write it deliberately, version it in git, calibrate it against humans, and watch the per-criterion kappa over time.\n\nDefault to 3 or 4 criteria. We tried 6 and the labelling cost killed us. 2 hides too much. 4 was the sweet spot in our data; your traces may need different.\n\nAnyone else done this switch? What criteria did you settle on, and how did the threshold tuning go?", "url": "https://wpnews.pro/news/switching-our-llm-as-judge-from-5-class-to-binary-in-ci-the-patterns-we-kept", "canonical_source": "https://dev.to/ethanwritesai/switching-our-llm-as-judge-from-5-class-to-binary-in-ci-the-patterns-we-kept-43i6", "published_at": "2026-06-03 17:24:59+00:00", "updated_at": "2026-06-03 17:42:07.500014+00:00", "lang": "en", "topics": ["large-language-models", "mlops", "artificial-intelligence", "natural-language-processing", "ai-tools"], "entities": ["Datadog"], "alternates": {"html": "https://wpnews.pro/news/switching-our-llm-as-judge-from-5-class-to-binary-in-ci-the-patterns-we-kept", "markdown": "https://wpnews.pro/news/switching-our-llm-as-judge-from-5-class-to-binary-in-ci-the-patterns-we-kept.md", "text": "https://wpnews.pro/news/switching-our-llm-as-judge-from-5-class-to-binary-in-ci-the-patterns-we-kept.txt", "jsonld": "https://wpnews.pro/news/switching-our-llm-as-judge-from-5-class-to-binary-in-ci-the-patterns-we-kept.jsonld"}}