# A Better LLM Judge? The Rubric Made My Small Model Worse

> Source: <https://dev.to/sumanpro/a-better-llm-judge-the-rubric-made-my-small-model-worse-311f>
> Published: 2026-06-29 08:07:48+00:00

In [Part 2](https://dev.to/sumanpro/llm-as-a-judge-i-built-one-from-scratch-then-checked-it-against-humans-4p4k) I built the laziest possible LLM judge — a tiny model (`Qwen2.5-1.5B`

) and a one-line rubric — and it agreed with human votes only ~43% of the time, crammed every score into a 7–8 band, and tied a third of the comparisons humans had no trouble separating.

Two things were wrong with that judge, and people usually fix only one:

I fixed each independently and measured the effect. The result wasn't the tidy "write a better rubric, it's free" story I expected — it was more interesting than that.

A genuinely large judge doesn't fit a free Kaggle GPU, and fighting transformers versions / OOM / sharding is exactly the yak-shaving real teams skip by calling a hosted endpoint. So the big judge runs on **OpenRouter** — one OpenAI-compatible API across many models, so swapping the judge is a one-line `BIG_ID`

change. The small baseline still runs locally (no reason to spend API calls on a 1.5B model).

Two things keep the calls cheap and short: cap the output (`max_tokens=160`

) and turn reasoning off (these models reason by default, which bloats output). Plus a small retry on the occasional 429:

```
BIG_ID = 'deepseek/deepseek-v4-pro'   # one-line swap; also ran qwen/qwen3-32b

def big_judge(question, answer, rubric, max_tokens=160, retries=4):
    kw = dict(model=BIG_ID, messages=build_messages(question, answer, rubric),
              temperature=0, max_tokens=max_tokens)
    for attempt in range(retries):
        try:
            try:   # disable reasoning (OpenRouter-specific); fall back if rejected
                resp = or_client.chat.completions.create(
                    extra_body={'reasoning': {'enabled': False}}, **kw)
            except Exception as inner:
                if 'reasoning' in str(inner).lower():
                    resp = or_client.chat.completions.create(**kw)
                else:
                    raise
            return parse_score(resp.choices[0].message.content or ''), None
        except Exception as e:
            if ('rate' in str(e).lower() or '429' in str(e)) and attempt < retries - 1:
                time.sleep(2 * (attempt + 1)); continue
            return float('nan'), None
```

Since the API calls are network-bound, the 2x2 runner fans them out across a thread pool (`ThreadPoolExecutor`

), so each big-judge condition finishes in a fraction of the sequential time. (Lesson learned the hard way on an earlier provider: with `max_tokens=512`

and no reasoning cap, a reasoning model spent ~4.5K tokens *thinking* per call and blew straight through that provider's rate limit. Capping output is the biggest lever.)

The naive rubric is what most people write and stop at:

```
NAIVE_RUBRIC = (
    'Score from 1 (terrible) to 10 (excellent) based on correctness and helpfulness. '
    'Respond EXACTLY as:\nSCORE: <number>'
)
```

The good rubric names explicit criteria, **anchors the scale** (what a 2/5/8/10 mean), and demands reasoning before the score:

```
GOOD_RUBRIC = (
    'You are an expert evaluator. Judge the answer on CORRECTNESS, COMPLETENESS, and '
    'INSTRUCTION-FOLLOWING. Use the FULL 1-10 scale, anchored:\n'
    '  1-2 = wrong/irrelevant.  3-4 = major errors.  5-6 = partial.\n'
    '  7-8 = correct, minor issues.  9-10 = fully correct and on-task.\n'
    'A confident, fluent answer that is factually WRONG must score 1-2, not high. '
    'First one sentence of reasoning, then:\nREASON: <one sentence>\nSCORE: <number>'
)
```

Same human-voted Chatbot Arena pairs as Part 2 (N=30), same independent single-answer scoring. The only things that change are model and rubric. To make sure the effect wasn't a quirk of one model, I ran the big judge **twice** — `deepseek/deepseek-v4-pro`

and `qwen/qwen3-32b`

— via OpenRouter. The small baseline is the same local `Qwen2.5-1.5B`

in both.

**Big judge = DeepSeek:**

| Condition | Agreement (decisive) | Agreement (overall) | Ties | Scale |
|---|---|---|---|---|
| small + naive | 67% | 47% | 9/30 | 2–10 |
small + good rubric
|
54% ⬇ |
43% | 6/30 | 1–10 |
| big + naive | 65% | 37% | 10/30 | 1–10 |
big + good rubric
|
79% ⬆ |
50% | 7/30 | 1–10 |

**Big judge = Qwen 32B (same pattern, milder):**

| Condition | Agreement (decisive) | Ties |
|---|---|---|
| small + naive | 67% | 9/30 |
| small + good rubric | 54% ⬇ | 6/30 |
| big + naive | 70% | 7/30 |
| big + good rubric | 71% ⬆ | 4/30 |

Read the rubric column carefully, on both. The good rubric **hurt the small model** (67%→54% — same on both runs) but **helped the big one** (DeepSeek: 65%→79%, a +14pt jump; Qwen: 70%→71% but with far fewer ties). The detailed, multi-criteria instructions that sharpened a capable model just *confused* the 1.5B.

One more thing the DeepSeek run exposes: `big + naive`

landed at 65% decisive / 37% overall — **no better than the small model**, and its worst tie count. A bigger, pricier judge with a lazy rubric bought nothing. The leap to 79% only came when the big model *and* a real rubric were used together.

I expected "a better rubric is the cheap win." The data said something more useful: **a good rubric is an instruction, and the model has to be capable enough to follow it.**

`big+naive`

actually landed at 67%/65% — flat — with its worst tie count).So the two fixes aren't independent levers you can add up. Hand a precise rubric to a weak model and you can make your eval *worse* than doing nothing; pay for a big model and skip the rubric and you've bought nothing. The best judge was the combination — big model **and** real rubric (DeepSeek hit 79%) — but the instructive results are the two traps on either side of it.

An LLM judge is an instrument: the model is the sensor, the rubric is the calibration. A precise calibration on a cheap sensor can read worse than no calibration at all. Specify both, and always check against human labels — because intuition (mine included) gets this wrong.

Three episodes, one thread: **a metric is only as honest as the conditions you measured it under.**

Evaluation isn't a box you tick once and quote forever — it's an instrument you specify, calibrate, and keep checking against ground truth, because the convenient number will always flatter you. Thanks for following along.

📓 **Full runnable notebook on Kaggle:** [[https://www.kaggle.com/code/sumannath88/ep03-better-judge-model-and-rubric](https://www.kaggle.com/code/sumannath88/ep03-better-judge-model-and-rubric)]

*Built with Hugging Face Transformers (small judge, local) + OpenRouter (big judges: deepseek-v4-pro and qwen3-32b). Data: LMSYS Chatbot Arena. Questions or corrections welcome in the comments.*
