LLM-as-a-Judge: I Built One From Scratch, Then Checked It Against Humans A developer built an LLM-as-a-judge from scratch using Qwen2.5-1.5B-Instruct and tested it against the LMSYS Chatbot Arena dataset with human votes. The judge scored answers independently and agreed with human winners only 43% of the time, revealing instability and a narrow scoring range that masked noise. The developer recommends evaluating evaluators by grading with repeats and reporting score spreads. In Part 1 https://dev.to/sumanpro/96-accuracy-was-a-lie-building-an-llm-eval-harness-from-scratch-idi the model's job was to pick one of 77 labels, so I could check it with == . But most real LLM output isn't like that — it's a paragraph, a summary, a support reply. There's no label to compare against. So people reach for the obvious move: use an LLM to grade the LLM. Show it a question and an answer, ask "how good is this, 1–10?", trust the number. It works shockingly well... right up until it doesn't, in ways that don't show up unless you go looking. I built that judge from scratch and checked it against a dataset that comes with real human votes : the LMSYS Chatbot Arena conversations via the ungated mirror agie-ai/lmsys-chatbot arena conversations , so this runs cold on Kaggle . Each row is a real user prompt, two chatbot answers, and a human verdict for which was better. JUDGE RUBRIC = 'You are grading the quality of an answer to a question. ' 'Score from 1 terrible to 10 excellent based on correctness and helpfulness. ' 'Respond in EXACTLY this format:\nSCORE: