A Better LLM Judge? The Rubric Made My Small Model Worse A developer found that improving the rubric for a small LLM judge (Qwen2.5-1.5B) did not increase its agreement with human votes, which remained around 43%. However, swapping to a larger model (DeepSeek or Qwen3-32B) via OpenRouter significantly boosted agreement to over 70%, showing that model size matters more than rubric quality for LLM-as-judge tasks. In Part 2 https://dev.to/sumanpro/llm-as-a-judge-i-built-one-from-scratch-then-checked-it-against-humans-4p4k I built the laziest possible LLM judge — a tiny model Qwen2.5-1.5B and a one-line rubric — and it agreed with human votes only ~43% of the time, crammed every score into a 7–8 band, and tied a third of the comparisons humans had no trouble separating. Two things were wrong with that judge, and people usually fix only one: I fixed each independently and measured the effect. The result wasn't the tidy "write a better rubric, it's free" story I expected — it was more interesting than that. A genuinely large judge doesn't fit a free Kaggle GPU, and fighting transformers versions / OOM / sharding is exactly the yak-shaving real teams skip by calling a hosted endpoint. So the big judge runs on OpenRouter — one OpenAI-compatible API across many models, so swapping the judge is a one-line BIG ID change. The small baseline still runs locally no reason to spend API calls on a 1.5B model . Two things keep the calls cheap and short: cap the output max tokens=160 and turn reasoning off these models reason by default, which bloats output . Plus a small retry on the occasional 429: BIG ID = 'deepseek/deepseek-v4-pro' one-line swap; also ran qwen/qwen3-32b def big judge question, answer, rubric, max tokens=160, retries=4 : kw = dict model=BIG ID, messages=build messages question, answer, rubric , temperature=0, max tokens=max tokens for attempt in range retries : try: try: disable reasoning OpenRouter-specific ; fall back if rejected resp = or client.chat.completions.create extra body={'reasoning': {'enabled': False}}, kw except Exception as inner: if 'reasoning' in str inner .lower : resp = or client.chat.completions.create kw else: raise return parse score resp.choices 0 .message.content or '' , None except Exception as e: if 'rate' in str e .lower or '429' in str e and attempt < retries - 1: time.sleep 2 attempt + 1 ; continue return float 'nan' , None Since the API calls are network-bound, the 2x2 runner fans them out across a thread pool ThreadPoolExecutor , so each big-judge condition finishes in a fraction of the sequential time. Lesson learned the hard way on an earlier provider: with max tokens=512 and no reasoning cap, a reasoning model spent ~4.5K tokens thinking per call and blew straight through that provider's rate limit. Capping output is the biggest lever. The naive rubric is what most people write and stop at: NAIVE RUBRIC = 'Score from 1 terrible to 10 excellent based on correctness and helpfulness. ' 'Respond EXACTLY as:\nSCORE: