In Part 2 I built the laziest possible LLM judge — a tiny model (Qwen2.5-1.5B
) and a one-line rubric — and it agreed with human votes only ~43% of the time, crammed every score into a 7–8 band, and tied a third of the comparisons humans had no trouble separating.
Two things were wrong with that judge, and people usually fix only one:
I fixed each independently and measured the effect. The result wasn't the tidy "write a better rubric, it's free" story I expected — it was more interesting than that.
A genuinely large judge doesn't fit a free Kaggle GPU, and fighting transformers versions / OOM / sharding is exactly the yak-shaving real teams skip by calling a hosted endpoint. So the big judge runs on OpenRouter — one OpenAI-compatible API across many models, so swapping the judge is a one-line BIG_ID
change. The small baseline still runs locally (no reason to spend API calls on a 1.5B model).
Two things keep the calls cheap and short: cap the output (max_tokens=160
) and turn reasoning off (these models reason by default, which bloats output). Plus a small retry on the occasional 429:
BIG_ID = 'deepseek/deepseek-v4-pro' # one-line swap; also ran qwen/qwen3-32b
def big_judge(question, answer, rubric, max_tokens=160, retries=4):
kw = dict(model=BIG_ID, messages=build_messages(question, answer, rubric),
temperature=0, max_tokens=max_tokens)
for attempt in range(retries):
try:
try: # disable reasoning (OpenRouter-specific); fall back if rejected
resp = or_client.chat.completions.create(
extra_body={'reasoning': {'enabled': False}}, **kw)
except Exception as inner:
if 'reasoning' in str(inner).lower():
resp = or_client.chat.completions.create(**kw)
else:
raise
return parse_score(resp.choices[0].message.content or ''), None
except Exception as e:
if ('rate' in str(e).lower() or '429' in str(e)) and attempt < retries - 1:
time.sleep(2 * (attempt + 1)); continue
return float('nan'), None
Since the API calls are network-bound, the 2x2 runner fans them out across a thread pool (ThreadPoolExecutor
), so each big-judge condition finishes in a fraction of the sequential time. (Lesson learned the hard way on an earlier provider: with max_tokens=512
and no reasoning cap, a reasoning model spent ~4.5K tokens thinking per call and blew straight through that provider's rate limit. Capping output is the biggest lever.)
The naive rubric is what most people write and stop at:
NAIVE_RUBRIC = (
'Score from 1 (terrible) to 10 (excellent) based on correctness and helpfulness. '
'Respond EXACTLY as:\nSCORE: <number>'
)
The good rubric names explicit criteria, anchors the scale (what a 2/5/8/10 mean), and demands reasoning before the score:
GOOD_RUBRIC = (
'You are an expert evaluator. Judge the answer on CORRECTNESS, COMPLETENESS, and '
'INSTRUCTION-FOLLOWING. Use the FULL 1-10 scale, anchored:\n'
' 1-2 = wrong/irrelevant. 3-4 = major errors. 5-6 = partial.\n'
' 7-8 = correct, minor issues. 9-10 = fully correct and on-task.\n'
'A confident, fluent answer that is factually WRONG must score 1-2, not high. '
'First one sentence of reasoning, then:\nREASON: <one sentence>\nSCORE: <number>'
)
Same human-voted Chatbot Arena pairs as Part 2 (N=30), same independent single-answer scoring. The only things that change are model and rubric. To make sure the effect wasn't a quirk of one model, I ran the big judge twice — deepseek/deepseek-v4-pro
and qwen/qwen3-32b
— via OpenRouter. The small baseline is the same local Qwen2.5-1.5B
in both.
Big judge = DeepSeek:
| Condition | Agreement (decisive) | Agreement (overall) | Ties | Scale |
|---|---|---|---|---|
| small + naive | 67% | 47% | 9/30 | 2–10 |
| small + good rubric | ||||
| 54% ⬇ | ||||
| 43% | 6/30 | 1–10 | ||
| big + naive | 65% | 37% | 10/30 | 1–10 |
| big + good rubric | ||||
| 79% ⬆ | ||||
| 50% | 7/30 | 1–10 |
Big judge = Qwen 32B (same pattern, milder):
| Condition | Agreement (decisive) | Ties |
|---|---|---|
| small + naive | 67% | 9/30 |
| small + good rubric | 54% ⬇ | 6/30 |
| big + naive | 70% | 7/30 |
| big + good rubric | 71% ⬆ | 4/30 |
Read the rubric column carefully, on both. The good rubric hurt the small model (67%→54% — same on both runs) but helped the big one (DeepSeek: 65%→79%, a +14pt jump; Qwen: 70%→71% but with far fewer ties). The detailed, multi-criteria instructions that sharpened a capable model just confused the 1.5B.
One more thing the DeepSeek run exposes: big + naive
landed at 65% decisive / 37% overall — no better than the small model, and its worst tie count. A bigger, pricier judge with a lazy rubric bought nothing. The leap to 79% only came when the big model and a real rubric were used together.
I expected "a better rubric is the cheap win." The data said something more useful: a good rubric is an instruction, and the model has to be capable enough to follow it.
big+naive
actually landed at 67%/65% — flat — with its worst tie count).So the two fixes aren't independent levers you can add up. Hand a precise rubric to a weak model and you can make your eval worse than doing nothing; pay for a big model and skip the rubric and you've bought nothing. The best judge was the combination — big model and real rubric (DeepSeek hit 79%) — but the instructive results are the two traps on either side of it.
An LLM judge is an instrument: the model is the sensor, the rubric is the calibration. A precise calibration on a cheap sensor can read worse than no calibration at all. Specify both, and always check against human labels — because intuition (mine included) gets this wrong.
Three episodes, one thread: a metric is only as honest as the conditions you measured it under.
Evaluation isn't a box you tick once and quote forever — it's an instrument you specify, calibrate, and keep checking against ground truth, because the convenient number will always flatter you. Thanks for following along.
📓 Full runnable notebook on Kaggle: [https://www.kaggle.com/code/sumannath88/ep03-better-judge-model-and-rubric]
Built with Hugging Face Transformers (small judge, local) + OpenRouter (big judges: deepseek-v4-pro and qwen3-32b). Data: LMSYS Chatbot Arena. Questions or corrections welcome in the comments.