LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans

A developer built an LLM-as-a-judge from scratch using Qwen2.5-1.5B-Instruct and tested it against the LMSYS Chatbot Arena dataset with human votes. The judge scored answers independently and agreed with human winners only 43% of the time, revealing instability and a narrow scoring range that masked noise. The developer recommends evaluating evaluators by grading with repeats and reporting score spreads.

In Part 1 https://dev.to/sumanpro/96-accuracy-was-a-lie-building-an-llm-eval-harness-from-scratch-idi the model's job was to pick one of 77 labels, so I could check it with == . But most real LLM output isn't like that — it's a paragraph, a summary, a support reply. There's no label to compare against. So people reach for the obvious move: use an LLM to grade the LLM. Show it a question and an answer, ask "how good is this, 1–10?", trust the number. It works shockingly well... right up until it doesn't, in ways that don't show up unless you go looking. I built that judge from scratch and checked it against a dataset that comes with real human votes : the LMSYS Chatbot Arena conversations via the ungated mirror agie-ai/lmsys-chatbot arena conversations , so this runs cold on Kaggle . Each row is a real user prompt, two chatbot answers, and a human verdict for which was better. JUDGE RUBRIC = 'You are grading the quality of an answer to a question. ' 'Score from 1 terrible to 10 excellent based on correctness and helpfulness. ' 'Respond in EXACTLY this format:\nSCORE: <number \nREASON: <one short sentence ' def judge question, answer, temperature=0.0 : prompt = f'{JUDGE RUBRIC}\n\nQUESTION:\n{question}\n\nANSWER:\n{answer}\n\nYour grade:' reply = generate prompt, max new tokens=64, temperature=temperature m = re.search r'SCORE:\s 0-9 + ?:\. 0-9 + ? ', reply return float m.group 1 if m else float 'nan' , reply That's it — Qwen2.5-1.5B-Instruct reading one answer and emitting a number. The rest of the notebook is about not trusting it blindly . Note the rubric is deliberately naive "correctness and helpfulness, 1–10" — it's the lazy version most people actually write, which is the point. I had the judge score one unchanged answer eight times at a realistic temperature: scores = judge sample q, sample a, temperature=0.7 0 for in range 8 8.0, 7.0, 8.0, 7.0, 8.0, 7.0, 8.0, 8.0 range: 7-8 | stdev: 0.48 Two problems. First, the score isn't stable — same answer, different numbers. Second, and worse: a "1–10" judge that only ever emits 7 or 8 isn't really using a 10-point scale. It has almost no resolution to separate "good" from "great." So when you A/B-test two prompts and one scores 7.6 vs 7.9, that gap is noise dressed up as a decimal. For each pair, I scored answer A and answer B independently the judge never sees both at once — this avoids position bias entirely , took the higher score as the judge's pick, and compared to the human winner: for p in pairs :60 : s a, = judge p 'question' , p 'ans a' s b, = judge p 'question' , p 'ans b' judge pick = 'tie' if s a == s b else 'model a' if s a s b else 'model b' ... Pairs scored: 60 judge gave equal scores on 20 of them On the 40 it scored decisively, it agreed with the HUMAN winner: 26/40 = 65% Read those two numbers together: Count the ties as misses and the judge lined up with human judgment on just 26/60 = 43% of all pairs. The disagreement cases tell you why it fails. My favorite: php Q : When is it today? judge scores - answer a: 3, answer b: 10 = judge picked model b but HUMANS preferred: model a The model has no idea what day it is, so a confident date is the wrong answer. The human caught that. The judge gave the confident-wrong answer a 10 and the honest hedge a 3. It wasn't grading correctness — it was grading confidence. Other receipts showed the same thing: 1-vs-2 and 8-vs-7 "decisions" that were really just noise around a tie. This notebook scored nothing new about a model . It audited the judge — the thing handing out the scores — and found two failures hiding behind a clean-looking number: it disagreed with itself run to run, and it agreed with people only 43% of the time. The fix isn't "don't use judges." It's evaluate your evaluator : grade with repeats and report the spread, not a single number, and calibrate against human labels before you trust the judge on data nobody has labeled. Part 3 https://dev.toLINK TO PART 3 : two obvious ways to fix a bad judge — a bigger model and a better rubric. I run all four combinations against the same human votes and measure how much each lever actually moves the needle. The cheaper fix does more of the work than you'd expect. 📓 Full runnable notebook on Kaggle: https://www.kaggle.com/code/sumannath88/ep02-llm-as-a-judge https://www.kaggle.com/code/sumannath88/ep02-llm-as-a-judge Built with PyTorch + Hugging Face Transformers. Data: LMSYS Chatbot Arena ungated mirror . Questions or corrections welcome in the comments.