# LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

> Source: <https://dev.to/sumanpro/llm-as-a-judge-i-built-one-from-scratch-then-checked-it-against-humans-4p4k>
> Published: 2026-06-29 08:05:50+00:00

In [Part 1](https://dev.to/sumanpro/96-accuracy-was-a-lie-building-an-llm-eval-harness-from-scratch-idi) the model's job was to pick one of 77 labels, so I could check it with `==`

. But most real LLM output isn't like that — it's a paragraph, a summary, a support reply. There's no label to compare against.

So people reach for the obvious move: **use an LLM to grade the LLM.** Show it a question and an answer, ask "how good is this, 1–10?", trust the number. It works shockingly well... right up until it doesn't, in ways that don't show up unless you go looking.

I built that judge from scratch and checked it against a dataset that comes with **real human votes**: the LMSYS Chatbot Arena conversations (via the ungated mirror `agie-ai/lmsys-chatbot_arena_conversations`

, so this runs cold on Kaggle). Each row is a real user prompt, two chatbot answers, and a human verdict for which was better.

```
JUDGE_RUBRIC = (
    'You are grading the quality of an answer to a question. '
    'Score from 1 (terrible) to 10 (excellent) based on correctness and helpfulness. '
    'Respond in EXACTLY this format:\nSCORE: <number>\nREASON: <one short sentence>'
)

def judge(question, answer, temperature=0.0):
    prompt = f'{JUDGE_RUBRIC}\n\nQUESTION:\n{question}\n\nANSWER:\n{answer}\n\nYour grade:'
    reply = generate(prompt, max_new_tokens=64, temperature=temperature)
    m = re.search(r'SCORE:\s*([0-9]+(?:\.[0-9]+)?)', reply)
    return (float(m.group(1)) if m else float('nan')), reply
```

That's it — `Qwen2.5-1.5B-Instruct`

reading one answer and emitting a number. The rest of the notebook is about *not trusting it blindly*. Note the rubric is deliberately naive ("correctness and helpfulness, 1–10") — it's the lazy version most people actually write, which is the point.

I had the judge score one unchanged answer eight times at a realistic temperature:

```
scores = [judge(sample_q, sample_a, temperature=0.7)[0] for _ in range(8)]
# [8.0, 7.0, 8.0, 7.0, 8.0, 7.0, 8.0, 8.0]
# range: 7-8 | stdev: 0.48
```

Two problems. First, the score isn't stable — same answer, different numbers. Second, and worse: a "1–10" judge that only ever emits 7 or 8 isn't really using a 10-point scale. It has almost no resolution to separate "good" from "great." So when you A/B-test two prompts and one scores 7.6 vs 7.9, that gap is noise dressed up as a decimal.

For each pair, I scored answer A and answer B **independently** (the judge never sees both at once — this avoids position bias entirely), took the higher score as the judge's pick, and compared to the human winner:

```
for p in pairs[:60]:
    s_a, _ = judge(p['question'], p['ans_a'])
    s_b, _ = judge(p['question'], p['ans_b'])
    judge_pick = 'tie' if s_a == s_b else ('model_a' if s_a > s_b else 'model_b')
    ...
# Pairs scored: 60 (judge gave equal scores on 20 of them)
# On the 40 it scored decisively, it agreed with the HUMAN winner: 26/40 = 65%
```

Read those two numbers together:

Count the ties as misses and the judge lined up with human judgment on just **26/60 = 43%** of all pairs.

The disagreement cases tell you *why* it fails. My favorite:

``` php
Q : When is it today?
  judge scores -> answer_a: 3, answer_b: 10  => judge picked model_b
  but HUMANS preferred: model_a
```

The model has no idea what day it is, so a *confident* date is the **wrong** answer. The human caught that. The judge gave the confident-wrong answer a 10 and the honest hedge a 3. It wasn't grading correctness — it was grading confidence. (Other receipts showed the same thing: 1-vs-2 and 8-vs-7 "decisions" that were really just noise around a tie.)

This notebook scored nothing new about a *model*. It audited the **judge** — the thing handing out the scores — and found two failures hiding behind a clean-looking number: it disagreed with itself run to run, and it agreed with people only 43% of the time.

The fix isn't "don't use judges." It's **evaluate your evaluator**: grade with repeats and report the spread, not a single number, and calibrate against human labels before you trust the judge on data nobody has labeled.

[Part 3](https://dev.toLINK_TO_PART_3): two obvious ways to fix a bad judge — a bigger model and a better rubric. I run all four combinations against the same human votes and measure how much each lever actually moves the needle. (The cheaper fix does more of the work than you'd expect.)

📓 **Full runnable notebook on Kaggle:** [[https://www.kaggle.com/code/sumannath88/ep02-llm-as-a-judge](https://www.kaggle.com/code/sumannath88/ep02-llm-as-a-judge)]

*Built with PyTorch + Hugging Face Transformers. Data: LMSYS Chatbot Arena (ungated mirror). Questions or corrections welcome in the comments.*
