{"slug": "show-hn-i-made-a-small-helper-for-checking-model-graded-answers", "title": "Show HN: I made a small helper for checking model-graded answers", "summary": "A PhD student released CMG, an open-source tool that audits LLM-based judges by requiring them to back each verdict with explicit claims tied to evidence, flagging untrustworthy decisions without using a second model. The tool addresses known biases in model-based grading such as position bias and rubric neglect, allowing researchers to identify cases needing human review in large evaluation runs.", "body_md": "I built CMG out of a practical need because as a PhD student studying how to evaluate AI systems, I sometimes use model-based graders in my experiments, which means relying on a language model as a judge. The problem is that those judges gave me too little control and clarity over their decisions. You cannot tell whether the judge actually checked your criteria or simply ignored the evidence you gave it. CMG try to closes that gap by making the judge back up each verdict with explicit claims and tying every claim to the evidence behind it. A set of plain checks then flags the cases where the verdict does not hold up, without putting a second model in the loop. It will not tell you who is right, but it will tell you which verdicts you can \"trust\" and which ones a person should read.\n\nLLM judges are useful, but they are not neutral. Researchers keep finding the same failure modes.\n\n- Zheng et al. report position bias, verbosity bias, self-enhancement bias, and limited reasoning.\n- Li et al. show scoring bias from rubric order, score ids, and reference answer scoring.\n- Feng et al. show that explicit rubrics and criteria can help judge consistency, but do not solve it.\n- Wang et al. show weak evidence verification in research-agent judging.\n- Chen et al. show reliability gaps for long-form outputs, even when rubrics or references are present.\n\nCMG does not pretend to fix these biases, but it does make them easy to spot. You tell the judge what to check by passing the task, the answer, an optional reference, the rubric, and the criteria, and CMG saves all of that as evidence for the judge to make claims against. Each verdict then has to rest on real claims, and each claim has to point back to a piece of that evidence, so when the judge cuts a corner the viewer flags it, whether that is missing evidence, an ignored reference, a rubric item nobody checked, a bad verdict, or an unsafe verdict change.\n\nFor now the local viewer is the dashboard.\n\n```\ncmg-view cmg-runs/*.cmg.jsonl --flagged-only\n```\n\nA web dashboard can read the same report data later.\n\nUse CMG when you run an LLM judge and cannot just trust what it says.\n\n**Large eval runs.** You score thousands of cases and cannot read every explanation by hand, so CMG flags the ones that need a human and lets you skip the rest.**Reference checks.** You want to catch a verdict that never cited the gold answer (`reference_ignored`\n\n).**Rubric coverage.** You need every criterion checked, not quietly skipped (`rubric_coverage_gap`\n\n).**Audit and debugging.** You want a replayable trail for each decision, so you can explain a score or work out why scores drift between runs.**Multi-turn judging.** You need to catch a verdict that flipped without a proper retraction (`verdict_flip_without_invalidation`\n\n).\n\nCMG will not tell you whether the judge is right, because that call still belongs to a person. What it does check is whether the judge backed its verdict, covered your rubric, and stayed consistent, and it points you at the cases where it did not.\n\n```\npip install claim-memory-graph\n```\n\nOptional provider helpers:\n\n```\npip install 'claim-memory-graph[openai]'\npip install 'claim-memory-graph[anthropic]'\n```\n\nThe distribution is named `claim-memory-graph`\n\n, but you import it as `cmg`\n\n. The core package has no runtime dependencies.\n\nStart with the local demo. It needs no API key.\n\n```\npython examples/local_judge_demo.py\ncmg-view cmg-runs/*.cmg.jsonl --summary\ncmg-view cmg-runs/*.cmg.jsonl --show-evidence\ncmg-view cmg-runs/*.cmg.jsonl --flagged-only\n```\n\nThe `--summary`\n\nview gives you the whole run at a glance.\n\nOnce that runs, wire CMG into your own judge. You keep the main task and the rubric. CMG only adds the audit layer.\n\n``` python\nfrom pathlib import Path\n\nfrom cmg import ClaimGraph, JsonlStorage, arun_judge, judge_report\n\nasync def judge_fn(messages):\n    return await call_your_judge_model(messages)\n\nasync with ClaimGraph(JsonlStorage(Path(\"cmg-runs/case-1.cmg.jsonl\"))) as graph:\n    result = await arun_judge(\n        graph,\n        judge_fn,\n        prompt=\"Question shown to the candidate model.\",\n        candidate_output=\"Candidate model answer.\",\n        reference_answer=\"Optional gold answer.\",\n        rubric=\"How the judge should decide.\",\n        criteria=(\"Correctness\", \"Completeness\"),\n        verdicts=(\"pass\", \"fail\"),\n    )\n\n    report = judge_report(graph)\n\nif result.decision is None:\n    print(\"The judge returned a missing or invalid verdict.\")\nelse:\n    print(result.decision.content)\n\nprint(report[\"human_review_flags\"])\n```\n\nThe judge's visible answer has to start with a verdict line.\n\n```\nVERDICT: pass\n```\n\nIt should also add a hidden CMG block with its claims.\n\ncmg\n{\"ops\": [{\"op\": \"commitment\", \"content\": \"The answer matches the reference.\", \"refs\": [\"s-...\"]}]}\nCMG records the final `Decision`\n\nitself, so if the model sends a `decision`\n\nop, `arun_judge`\n\nignores it. And if the model returns `maybe`\n\nwhen only `pass`\n\nand `fail`\n\nare allowed, CMG records no decision and the report marks the case for human review.\n\n`judge_report(graph)`\n\nreturns these fields.\n\n`verdict`\n\n`claims`\n\n`criteria`\n\n`judge_responses`\n\n`verdict_errors`\n\n`retracted`\n\n`human_review_flags`\n\n`violations`\n\nFlags come in two kinds. Hard flags are real failures in the audit. Soft flags are gentler, just things to review. Here are the ones you will use most.\n\n| Flag | Meaning |\n|---|---|\n`missing_verdict` |\nThe judge did not return a valid verdict line. |\n`invalid_verdict` |\nThe verdict was not in the allowed list. |\n`uncited_verdict` |\nA verdict has no active cited claims. |\n`no_supported_claims` |\nNo active claim has valid evidence. |\n`criterion_citation_gap` |\nA criterion was discussed or may be covered, but no active claim cited that exact criterion id. |\n`rubric_coverage_gap` |\nA criterion does not appear to be covered by any active claim text. |\n`reference_ignored` |\nA reference answer exists, but no active claim cites it. |\n`verdict_flip_without_invalidation` |\nA verdict changed without retracting old claims first. |\n`silent_commitment_drop` |\nA later decision dropped an active claim without a retraction. |\n\nCMG does not replace your eval framework. It sits inside it. Keep using the framework for datasets, model calls, scores, and totals. Let CMG hold the per-case audit log. Each example below is a small adapter you can drop into one common setup.\n\n**DeepEval.** Wrap`arun_judge`\n\nin a custom metric.`examples/deepeval_metric.py`\n\nsubclasses`BaseMetric`\n\n, so each`measure`\n\ncall writes a per-case`.cmg.jsonl`\n\n, turns the verdict into a score, and puts the CMG path and review flags in the metric's`reason`\n\n.**Inspect AI.** Register a`@scorer`\n\nthat runs the judge.`examples/inspect_ai_scorer.py`\n\nreturns an Inspect`Score`\n\nand keeps the CMG graph path, review flags, and claims in the score metadata, so the audit data rides along with every sample.**OpenAI, or any provider.** For a judge with no framework around it,`examples/openai_judge_demo.py`\n\npasses`make_openai_llm_fn(...)`\n\nstraight in as the`judge_fn`\n\n. CMG does not care which provider sits behind it.\n\nUse a fresh output file for each case run. Do not append many runs of the same case to one JSONL file.\n\n| Topic | Link |\n|---|---|\n| User guide |\n|\n\n[docs/dev-guide.md](/MatteoLeonesi/claim-memory-graph-sdk/blob/main/docs/dev-guide.md)[docs/release.md](/MatteoLeonesi/claim-memory-graph-sdk/blob/main/docs/release.md)*These docs, this README included, were drafted with AI and reviewed by hand.*\n\n[Zheng et al., Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685)[Li et al., Evaluating Scoring Bias in LLM-as-a-Judge](https://arxiv.org/abs/2506.22316)[Feng et al., Are We on the Right Way to Assessing LLM-as-a-Judge?](https://arxiv.org/abs/2512.16041)[Wang et al., Time to REFLECT: Can We Trust LLM Judges for Evidence-based Research Agents?](https://arxiv.org/abs/2605.19196)[Chen et al., Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation](https://arxiv.org/abs/2606.01629)\n\nApache-2.0.", "url": "https://wpnews.pro/news/show-hn-i-made-a-small-helper-for-checking-model-graded-answers", "canonical_source": "https://github.com/MatteoLeonesi/claim-memory-graph-sdk", "published_at": "2026-06-14 16:26:43+00:00", "updated_at": "2026-06-14 16:42:29.734505+00:00", "lang": "en", "topics": ["ai-safety", "ai-research", "ai-tools", "large-language-models", "ai-ethics"], "entities": ["CMG", "Zheng", "Li", "Feng", "Wang", "Chen", "OpenAI", "Anthropic"], "alternates": {"html": "https://wpnews.pro/news/show-hn-i-made-a-small-helper-for-checking-model-graded-answers", "markdown": "https://wpnews.pro/news/show-hn-i-made-a-small-helper-for-checking-model-graded-answers.md", "text": "https://wpnews.pro/news/show-hn-i-made-a-small-helper-for-checking-model-graded-answers.txt", "jsonld": "https://wpnews.pro/news/show-hn-i-made-a-small-helper-for-checking-model-graded-answers.jsonld"}}