cd /news/large-language-models/llm-as-a-judge-i-built-one-from-scra… Β· home β€Ί topics β€Ί large-language-models β€Ί article
[ARTICLE Β· art-43138] src=dev.to β†— pub= topic=large-language-models verified=true sentiment=Β· neutral

LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

A developer built an LLM-as-a-judge from scratch using Qwen2.5-1.5B-Instruct and tested it against the LMSYS Chatbot Arena dataset with human votes. The judge scored answers independently and agreed with human winners only 43% of the time, revealing instability and a narrow scoring range that masked noise. The developer recommends evaluating evaluators by grading with repeats and reporting score spreads.

read4 min views1 publishedJun 29, 2026

In Part 1 the model's job was to pick one of 77 labels, so I could check it with ==

. But most real LLM output isn't like that β€” it's a paragraph, a summary, a support reply. There's no label to compare against.

So people reach for the obvious move: use an LLM to grade the LLM. Show it a question and an answer, ask "how good is this, 1–10?", trust the number. It works shockingly well... right up until it doesn't, in ways that don't show up unless you go looking.

I built that judge from scratch and checked it against a dataset that comes with real human votes: the LMSYS Chatbot Arena conversations (via the ungated mirror agie-ai/lmsys-chatbot_arena_conversations

, so this runs cold on Kaggle). Each row is a real user prompt, two chatbot answers, and a human verdict for which was better.

JUDGE_RUBRIC = (
    'You are grading the quality of an answer to a question. '
    'Score from 1 (terrible) to 10 (excellent) based on correctness and helpfulness. '
    'Respond in EXACTLY this format:\nSCORE: <number>\nREASON: <one short sentence>'
)

def judge(question, answer, temperature=0.0):
    prompt = f'{JUDGE_RUBRIC}\n\nQUESTION:\n{question}\n\nANSWER:\n{answer}\n\nYour grade:'
    reply = generate(prompt, max_new_tokens=64, temperature=temperature)
    m = re.search(r'SCORE:\s*([0-9]+(?:\.[0-9]+)?)', reply)
    return (float(m.group(1)) if m else float('nan')), reply

That's it β€” Qwen2.5-1.5B-Instruct

reading one answer and emitting a number. The rest of the notebook is about not trusting it blindly. Note the rubric is deliberately naive ("correctness and helpfulness, 1–10") β€” it's the lazy version most people actually write, which is the point.

I had the judge score one unchanged answer eight times at a realistic temperature:

scores = [judge(sample_q, sample_a, temperature=0.7)[0] for _ in range(8)]

Two problems. First, the score isn't stable β€” same answer, different numbers. Second, and worse: a "1–10" judge that only ever emits 7 or 8 isn't really using a 10-point scale. It has almost no resolution to separate "good" from "great." So when you A/B-test two prompts and one scores 7.6 vs 7.9, that gap is noise dressed up as a decimal.

For each pair, I scored answer A and answer B independently (the judge never sees both at once β€” this avoids position bias entirely), took the higher score as the judge's pick, and compared to the human winner:

for p in pairs[:60]:
    s_a, _ = judge(p['question'], p['ans_a'])
    s_b, _ = judge(p['question'], p['ans_b'])
    judge_pick = 'tie' if s_a == s_b else ('model_a' if s_a > s_b else 'model_b')
    ...

Read those two numbers together:

Count the ties as misses and the judge lined up with human judgment on just 26/60 = 43% of all pairs.

The disagreement cases tell you why it fails. My favorite:

Q : When is it today?
  judge scores -> answer_a: 3, answer_b: 10  => judge picked model_b
  but HUMANS preferred: model_a

The model has no idea what day it is, so a confident date is the wrong answer. The human caught that. The judge gave the confident-wrong answer a 10 and the honest hedge a 3. It wasn't grading correctness β€” it was grading confidence. (Other receipts showed the same thing: 1-vs-2 and 8-vs-7 "decisions" that were really just noise around a tie.)

This notebook scored nothing new about a model. It audited the judge β€” the thing handing out the scores β€” and found two failures hiding behind a clean-looking number: it disagreed with itself run to run, and it agreed with people only 43% of the time.

The fix isn't "don't use judges." It's evaluate your evaluator: grade with repeats and report the spread, not a single number, and calibrate against human labels before you trust the judge on data nobody has labeled.

Part 3: two obvious ways to fix a bad judge β€” a bigger model and a better rubric. I run all four combinations against the same human votes and measure how much each lever actually moves the needle. (The cheaper fix does more of the work than you'd expect.)

πŸ““ Full runnable notebook on Kaggle: [https://www.kaggle.com/code/sumannath88/ep02-llm-as-a-judge]

Built with PyTorch + Hugging Face Transformers. Data: LMSYS Chatbot Arena (ungated mirror). Questions or corrections welcome in the comments.

── more in #large-language-models 4 stories Β· sorted by recency
── more on @qwen2.5-1.5b-instruct 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/llm-as-a-judge-i-bui…] indexed:0 read:4min 2026-06-29 Β· β€”