{"slug": "a-better-llm-judge-the-rubric-made-my-small-model-worse", "title": "A Better LLM Judge? The Rubric Made My Small Model Worse", "summary": "A developer found that improving the rubric for a small LLM judge (Qwen2.5-1.5B) did not increase its agreement with human votes, which remained around 43%. However, swapping to a larger model (DeepSeek or Qwen3-32B) via OpenRouter significantly boosted agreement to over 70%, showing that model size matters more than rubric quality for LLM-as-judge tasks.", "body_md": "In [Part 2](https://dev.to/sumanpro/llm-as-a-judge-i-built-one-from-scratch-then-checked-it-against-humans-4p4k) I built the laziest possible LLM judge — a tiny model (`Qwen2.5-1.5B`\n\n) and a one-line rubric — and it agreed with human votes only ~43% of the time, crammed every score into a 7–8 band, and tied a third of the comparisons humans had no trouble separating.\n\nTwo things were wrong with that judge, and people usually fix only one:\n\nI fixed each independently and measured the effect. The result wasn't the tidy \"write a better rubric, it's free\" story I expected — it was more interesting than that.\n\nA genuinely large judge doesn't fit a free Kaggle GPU, and fighting transformers versions / OOM / sharding is exactly the yak-shaving real teams skip by calling a hosted endpoint. So the big judge runs on **OpenRouter** — one OpenAI-compatible API across many models, so swapping the judge is a one-line `BIG_ID`\n\nchange. The small baseline still runs locally (no reason to spend API calls on a 1.5B model).\n\nTwo things keep the calls cheap and short: cap the output (`max_tokens=160`\n\n) and turn reasoning off (these models reason by default, which bloats output). Plus a small retry on the occasional 429:\n\n```\nBIG_ID = 'deepseek/deepseek-v4-pro'   # one-line swap; also ran qwen/qwen3-32b\n\ndef big_judge(question, answer, rubric, max_tokens=160, retries=4):\n    kw = dict(model=BIG_ID, messages=build_messages(question, answer, rubric),\n              temperature=0, max_tokens=max_tokens)\n    for attempt in range(retries):\n        try:\n            try:   # disable reasoning (OpenRouter-specific); fall back if rejected\n                resp = or_client.chat.completions.create(\n                    extra_body={'reasoning': {'enabled': False}}, **kw)\n            except Exception as inner:\n                if 'reasoning' in str(inner).lower():\n                    resp = or_client.chat.completions.create(**kw)\n                else:\n                    raise\n            return parse_score(resp.choices[0].message.content or ''), None\n        except Exception as e:\n            if ('rate' in str(e).lower() or '429' in str(e)) and attempt < retries - 1:\n                time.sleep(2 * (attempt + 1)); continue\n            return float('nan'), None\n```\n\nSince the API calls are network-bound, the 2x2 runner fans them out across a thread pool (`ThreadPoolExecutor`\n\n), so each big-judge condition finishes in a fraction of the sequential time. (Lesson learned the hard way on an earlier provider: with `max_tokens=512`\n\nand no reasoning cap, a reasoning model spent ~4.5K tokens *thinking* per call and blew straight through that provider's rate limit. Capping output is the biggest lever.)\n\nThe naive rubric is what most people write and stop at:\n\n```\nNAIVE_RUBRIC = (\n    'Score from 1 (terrible) to 10 (excellent) based on correctness and helpfulness. '\n    'Respond EXACTLY as:\\nSCORE: <number>'\n)\n```\n\nThe good rubric names explicit criteria, **anchors the scale** (what a 2/5/8/10 mean), and demands reasoning before the score:\n\n```\nGOOD_RUBRIC = (\n    'You are an expert evaluator. Judge the answer on CORRECTNESS, COMPLETENESS, and '\n    'INSTRUCTION-FOLLOWING. Use the FULL 1-10 scale, anchored:\\n'\n    '  1-2 = wrong/irrelevant.  3-4 = major errors.  5-6 = partial.\\n'\n    '  7-8 = correct, minor issues.  9-10 = fully correct and on-task.\\n'\n    'A confident, fluent answer that is factually WRONG must score 1-2, not high. '\n    'First one sentence of reasoning, then:\\nREASON: <one sentence>\\nSCORE: <number>'\n)\n```\n\nSame human-voted Chatbot Arena pairs as Part 2 (N=30), same independent single-answer scoring. The only things that change are model and rubric. To make sure the effect wasn't a quirk of one model, I ran the big judge **twice** — `deepseek/deepseek-v4-pro`\n\nand `qwen/qwen3-32b`\n\n— via OpenRouter. The small baseline is the same local `Qwen2.5-1.5B`\n\nin both.\n\n**Big judge = DeepSeek:**\n\n| Condition | Agreement (decisive) | Agreement (overall) | Ties | Scale |\n|---|---|---|---|---|\n| small + naive | 67% | 47% | 9/30 | 2–10 |\nsmall + good rubric\n|\n54% ⬇ |\n43% | 6/30 | 1–10 |\n| big + naive | 65% | 37% | 10/30 | 1–10 |\nbig + good rubric\n|\n79% ⬆ |\n50% | 7/30 | 1–10 |\n\n**Big judge = Qwen 32B (same pattern, milder):**\n\n| Condition | Agreement (decisive) | Ties |\n|---|---|---|\n| small + naive | 67% | 9/30 |\n| small + good rubric | 54% ⬇ | 6/30 |\n| big + naive | 70% | 7/30 |\n| big + good rubric | 71% ⬆ | 4/30 |\n\nRead the rubric column carefully, on both. The good rubric **hurt the small model** (67%→54% — same on both runs) but **helped the big one** (DeepSeek: 65%→79%, a +14pt jump; Qwen: 70%→71% but with far fewer ties). The detailed, multi-criteria instructions that sharpened a capable model just *confused* the 1.5B.\n\nOne more thing the DeepSeek run exposes: `big + naive`\n\nlanded at 65% decisive / 37% overall — **no better than the small model**, and its worst tie count. A bigger, pricier judge with a lazy rubric bought nothing. The leap to 79% only came when the big model *and* a real rubric were used together.\n\nI expected \"a better rubric is the cheap win.\" The data said something more useful: **a good rubric is an instruction, and the model has to be capable enough to follow it.**\n\n`big+naive`\n\nactually landed at 67%/65% — flat — with its worst tie count).So the two fixes aren't independent levers you can add up. Hand a precise rubric to a weak model and you can make your eval *worse* than doing nothing; pay for a big model and skip the rubric and you've bought nothing. The best judge was the combination — big model **and** real rubric (DeepSeek hit 79%) — but the instructive results are the two traps on either side of it.\n\nAn LLM judge is an instrument: the model is the sensor, the rubric is the calibration. A precise calibration on a cheap sensor can read worse than no calibration at all. Specify both, and always check against human labels — because intuition (mine included) gets this wrong.\n\nThree episodes, one thread: **a metric is only as honest as the conditions you measured it under.**\n\nEvaluation isn't a box you tick once and quote forever — it's an instrument you specify, calibrate, and keep checking against ground truth, because the convenient number will always flatter you. Thanks for following along.\n\n📓 **Full runnable notebook on Kaggle:** [[https://www.kaggle.com/code/sumannath88/ep03-better-judge-model-and-rubric](https://www.kaggle.com/code/sumannath88/ep03-better-judge-model-and-rubric)]\n\n*Built with Hugging Face Transformers (small judge, local) + OpenRouter (big judges: deepseek-v4-pro and qwen3-32b). Data: LMSYS Chatbot Arena. Questions or corrections welcome in the comments.*", "url": "https://wpnews.pro/news/a-better-llm-judge-the-rubric-made-my-small-model-worse", "canonical_source": "https://dev.to/sumanpro/a-better-llm-judge-the-rubric-made-my-small-model-worse-311f", "published_at": "2026-06-29 08:07:48+00:00", "updated_at": "2026-06-29 08:27:41.970205+00:00", "lang": "en", "topics": ["large-language-models", "ai-research", "ai-tools", "developer-tools"], "entities": ["Qwen2.5-1.5B", "DeepSeek", "Qwen3-32B", "OpenRouter", "Chatbot Arena", "Kaggle"], "alternates": {"html": "https://wpnews.pro/news/a-better-llm-judge-the-rubric-made-my-small-model-worse", "markdown": "https://wpnews.pro/news/a-better-llm-judge-the-rubric-made-my-small-model-worse.md", "text": "https://wpnews.pro/news/a-better-llm-judge-the-rubric-made-my-small-model-worse.txt", "jsonld": "https://wpnews.pro/news/a-better-llm-judge-the-rubric-made-my-small-model-worse.jsonld"}}