# A Better LLM Judge? The Rubric Made My Small Model Worse > Source: > Published: 2026-06-29 08:07:48+00:00 In [Part 2](https://dev.to/sumanpro/llm-as-a-judge-i-built-one-from-scratch-then-checked-it-against-humans-4p4k) I built the laziest possible LLM judge — a tiny model (`Qwen2.5-1.5B` ) and a one-line rubric — and it agreed with human votes only ~43% of the time, crammed every score into a 7–8 band, and tied a third of the comparisons humans had no trouble separating. Two things were wrong with that judge, and people usually fix only one: I fixed each independently and measured the effect. The result wasn't the tidy "write a better rubric, it's free" story I expected — it was more interesting than that. A genuinely large judge doesn't fit a free Kaggle GPU, and fighting transformers versions / OOM / sharding is exactly the yak-shaving real teams skip by calling a hosted endpoint. So the big judge runs on **OpenRouter** — one OpenAI-compatible API across many models, so swapping the judge is a one-line `BIG_ID` change. The small baseline still runs locally (no reason to spend API calls on a 1.5B model). Two things keep the calls cheap and short: cap the output (`max_tokens=160` ) and turn reasoning off (these models reason by default, which bloats output). Plus a small retry on the occasional 429: ``` BIG_ID = 'deepseek/deepseek-v4-pro' # one-line swap; also ran qwen/qwen3-32b def big_judge(question, answer, rubric, max_tokens=160, retries=4): kw = dict(model=BIG_ID, messages=build_messages(question, answer, rubric), temperature=0, max_tokens=max_tokens) for attempt in range(retries): try: try: # disable reasoning (OpenRouter-specific); fall back if rejected resp = or_client.chat.completions.create( extra_body={'reasoning': {'enabled': False}}, **kw) except Exception as inner: if 'reasoning' in str(inner).lower(): resp = or_client.chat.completions.create(**kw) else: raise return parse_score(resp.choices[0].message.content or ''), None except Exception as e: if ('rate' in str(e).lower() or '429' in str(e)) and attempt < retries - 1: time.sleep(2 * (attempt + 1)); continue return float('nan'), None ``` Since the API calls are network-bound, the 2x2 runner fans them out across a thread pool (`ThreadPoolExecutor` ), so each big-judge condition finishes in a fraction of the sequential time. (Lesson learned the hard way on an earlier provider: with `max_tokens=512` and no reasoning cap, a reasoning model spent ~4.5K tokens *thinking* per call and blew straight through that provider's rate limit. Capping output is the biggest lever.) The naive rubric is what most people write and stop at: ``` NAIVE_RUBRIC = ( 'Score from 1 (terrible) to 10 (excellent) based on correctness and helpfulness. ' 'Respond EXACTLY as:\nSCORE: ' ) ``` The good rubric names explicit criteria, **anchors the scale** (what a 2/5/8/10 mean), and demands reasoning before the score: ``` GOOD_RUBRIC = ( 'You are an expert evaluator. Judge the answer on CORRECTNESS, COMPLETENESS, and ' 'INSTRUCTION-FOLLOWING. Use the FULL 1-10 scale, anchored:\n' ' 1-2 = wrong/irrelevant. 3-4 = major errors. 5-6 = partial.\n' ' 7-8 = correct, minor issues. 9-10 = fully correct and on-task.\n' 'A confident, fluent answer that is factually WRONG must score 1-2, not high. ' 'First one sentence of reasoning, then:\nREASON: \nSCORE: ' ) ``` Same human-voted Chatbot Arena pairs as Part 2 (N=30), same independent single-answer scoring. The only things that change are model and rubric. To make sure the effect wasn't a quirk of one model, I ran the big judge **twice** — `deepseek/deepseek-v4-pro` and `qwen/qwen3-32b` — via OpenRouter. The small baseline is the same local `Qwen2.5-1.5B` in both. **Big judge = DeepSeek:** | Condition | Agreement (decisive) | Agreement (overall) | Ties | Scale | |---|---|---|---|---| | small + naive | 67% | 47% | 9/30 | 2–10 | small + good rubric | 54% ⬇ | 43% | 6/30 | 1–10 | | big + naive | 65% | 37% | 10/30 | 1–10 | big + good rubric | 79% ⬆ | 50% | 7/30 | 1–10 | **Big judge = Qwen 32B (same pattern, milder):** | Condition | Agreement (decisive) | Ties | |---|---|---| | small + naive | 67% | 9/30 | | small + good rubric | 54% ⬇ | 6/30 | | big + naive | 70% | 7/30 | | big + good rubric | 71% ⬆ | 4/30 | Read the rubric column carefully, on both. The good rubric **hurt the small model** (67%→54% — same on both runs) but **helped the big one** (DeepSeek: 65%→79%, a +14pt jump; Qwen: 70%→71% but with far fewer ties). The detailed, multi-criteria instructions that sharpened a capable model just *confused* the 1.5B. One more thing the DeepSeek run exposes: `big + naive` landed at 65% decisive / 37% overall — **no better than the small model**, and its worst tie count. A bigger, pricier judge with a lazy rubric bought nothing. The leap to 79% only came when the big model *and* a real rubric were used together. I expected "a better rubric is the cheap win." The data said something more useful: **a good rubric is an instruction, and the model has to be capable enough to follow it.** `big+naive` actually landed at 67%/65% — flat — with its worst tie count).So the two fixes aren't independent levers you can add up. Hand a precise rubric to a weak model and you can make your eval *worse* than doing nothing; pay for a big model and skip the rubric and you've bought nothing. The best judge was the combination — big model **and** real rubric (DeepSeek hit 79%) — but the instructive results are the two traps on either side of it. An LLM judge is an instrument: the model is the sensor, the rubric is the calibration. A precise calibration on a cheap sensor can read worse than no calibration at all. Specify both, and always check against human labels — because intuition (mine included) gets this wrong. Three episodes, one thread: **a metric is only as honest as the conditions you measured it under.** Evaluation isn't a box you tick once and quote forever — it's an instrument you specify, calibrate, and keep checking against ground truth, because the convenient number will always flatter you. Thanks for following along. 📓 **Full runnable notebook on Kaggle:** [[https://www.kaggle.com/code/sumannath88/ep03-better-judge-model-and-rubric](https://www.kaggle.com/code/sumannath88/ep03-better-judge-model-and-rubric)] *Built with Hugging Face Transformers (small judge, local) + OpenRouter (big judges: deepseek-v4-pro and qwen3-32b). Data: LMSYS Chatbot Arena. Questions or corrections welcome in the comments.*