A Better LLM Judge? The Rubric Made My Small Model Worse

A developer found that improving the rubric for a small LLM judge (Qwen2.5-1.5B) did not increase its agreement with human votes, which remained around 43%. However, swapping to a larger model (DeepSeek or Qwen3-32B) via OpenRouter significantly boosted agreement to over 70%, showing that model size matters more than rubric quality for LLM-as-judge tasks.

In Part 2 https://dev.to/sumanpro/llm-as-a-judge-i-built-one-from-scratch-then-checked-it-against-humans-4p4k I built the laziest possible LLM judge — a tiny model Qwen2.5-1.5B and a one-line rubric — and it agreed with human votes only ~43% of the time, crammed every score into a 7–8 band, and tied a third of the comparisons humans had no trouble separating. Two things were wrong with that judge, and people usually fix only one: I fixed each independently and measured the effect. The result wasn't the tidy "write a better rubric, it's free" story I expected — it was more interesting than that. A genuinely large judge doesn't fit a free Kaggle GPU, and fighting transformers versions / OOM / sharding is exactly the yak-shaving real teams skip by calling a hosted endpoint. So the big judge runs on OpenRouter — one OpenAI-compatible API across many models, so swapping the judge is a one-line BIG ID change. The small baseline still runs locally no reason to spend API calls on a 1.5B model . Two things keep the calls cheap and short: cap the output max tokens=160 and turn reasoning off these models reason by default, which bloats output . Plus a small retry on the occasional 429: BIG ID = 'deepseek/deepseek-v4-pro' one-line swap; also ran qwen/qwen3-32b def big judge question, answer, rubric, max tokens=160, retries=4 : kw = dict model=BIG ID, messages=build messages question, answer, rubric , temperature=0, max tokens=max tokens for attempt in range retries : try: try: disable reasoning OpenRouter-specific ; fall back if rejected resp = or client.chat.completions.create extra body={'reasoning': {'enabled': False}}, kw except Exception as inner: if 'reasoning' in str inner .lower : resp = or client.chat.completions.create kw else: raise return parse score resp.choices 0 .message.content or '' , None except Exception as e: if 'rate' in str e .lower or '429' in str e and attempt < retries - 1: time.sleep 2 attempt + 1 ; continue return float 'nan' , None Since the API calls are network-bound, the 2x2 runner fans them out across a thread pool ThreadPoolExecutor , so each big-judge condition finishes in a fraction of the sequential time. Lesson learned the hard way on an earlier provider: with max tokens=512 and no reasoning cap, a reasoning model spent ~4.5K tokens thinking per call and blew straight through that provider's rate limit. Capping output is the biggest lever. The naive rubric is what most people write and stop at: NAIVE RUBRIC = 'Score from 1 terrible to 10 excellent based on correctness and helpfulness. ' 'Respond EXACTLY as:\nSCORE: <number ' The good rubric names explicit criteria, anchors the scale what a 2/5/8/10 mean , and demands reasoning before the score: GOOD RUBRIC = 'You are an expert evaluator. Judge the answer on CORRECTNESS, COMPLETENESS, and ' 'INSTRUCTION-FOLLOWING. Use the FULL 1-10 scale, anchored:\n' ' 1-2 = wrong/irrelevant. 3-4 = major errors. 5-6 = partial.\n' ' 7-8 = correct, minor issues. 9-10 = fully correct and on-task.\n' 'A confident, fluent answer that is factually WRONG must score 1-2, not high. ' 'First one sentence of reasoning, then:\nREASON: <one sentence \nSCORE: <number ' Same human-voted Chatbot Arena pairs as Part 2 N=30 , same independent single-answer scoring. The only things that change are model and rubric. To make sure the effect wasn't a quirk of one model, I ran the big judge twice — deepseek/deepseek-v4-pro and qwen/qwen3-32b — via OpenRouter. The small baseline is the same local Qwen2.5-1.5B in both. Big judge = DeepSeek: | Condition | Agreement decisive | Agreement overall | Ties | Scale | |---|---|---|---|---| | small + naive | 67% | 47% | 9/30 | 2–10 | small + good rubric | 54% ⬇ | 43% | 6/30 | 1–10 | | big + naive | 65% | 37% | 10/30 | 1–10 | big + good rubric | 79% ⬆ | 50% | 7/30 | 1–10 | Big judge = Qwen 32B same pattern, milder : | Condition | Agreement decisive | Ties | |---|---|---| | small + naive | 67% | 9/30 | | small + good rubric | 54% ⬇ | 6/30 | | big + naive | 70% | 7/30 | | big + good rubric | 71% ⬆ | 4/30 | Read the rubric column carefully, on both. The good rubric hurt the small model 67%→54% — same on both runs but helped the big one DeepSeek: 65%→79%, a +14pt jump; Qwen: 70%→71% but with far fewer ties . The detailed, multi-criteria instructions that sharpened a capable model just confused the 1.5B. One more thing the DeepSeek run exposes: big + naive landed at 65% decisive / 37% overall — no better than the small model , and its worst tie count. A bigger, pricier judge with a lazy rubric bought nothing. The leap to 79% only came when the big model and a real rubric were used together. I expected "a better rubric is the cheap win." The data said something more useful: a good rubric is an instruction, and the model has to be capable enough to follow it. big+naive actually landed at 67%/65% — flat — with its worst tie count .So the two fixes aren't independent levers you can add up. Hand a precise rubric to a weak model and you can make your eval worse than doing nothing; pay for a big model and skip the rubric and you've bought nothing. The best judge was the combination — big model and real rubric DeepSeek hit 79% — but the instructive results are the two traps on either side of it. An LLM judge is an instrument: the model is the sensor, the rubric is the calibration. A precise calibration on a cheap sensor can read worse than no calibration at all. Specify both, and always check against human labels — because intuition mine included gets this wrong. Three episodes, one thread: a metric is only as honest as the conditions you measured it under. Evaluation isn't a box you tick once and quote forever — it's an instrument you specify, calibrate, and keep checking against ground truth, because the convenient number will always flatter you. Thanks for following along. 📓 Full runnable notebook on Kaggle: https://www.kaggle.com/code/sumannath88/ep03-better-judge-model-and-rubric https://www.kaggle.com/code/sumannath88/ep03-better-judge-model-and-rubric Built with Hugging Face Transformers small judge, local + OpenRouter big judges: deepseek-v4-pro and qwen3-32b . Data: LMSYS Chatbot Arena. Questions or corrections welcome in the comments.