00:00
2026-06-30
dev.to
large-language-models
The AI judge that called a half-finished audit 'exhaustive'
An engineer building a benchmark for AI coding agents discovered that an LLM judge incorrectly scored a half-finished audit as 'exhaustive' because it lacked a reference answer. The judge evaluated thβ¦