If you're building anything with an LLM judge in the loop, this is the failure mode that will get you, and you won't see it happen. I didn't, until I went looking for the opposite. The story, in the order it happened.
I wanted to measure something specific: can an AI coding agent navigate a real codebase, not just read one file, but answer "what depends on this model, everywhere, before I change it." That's the question a maintainer answers in their head before a risky refactor, and it's the one an agent tends to get confidently wrong.
So I built a benchmark. Pick the hub model of a real app, the Inbox
in Chatwoot, and ask the agent to find every dependent before a teardown change. Run it two ways: a plain agent that greps and infers, and the same agent handed a structural map of the codebase it can query. Same model, same prompt, same pinned commit. Measure the difference.
The catch with any benchmark like this is the grading. You can't eyeball a hundred audits. So, like everyone does now, I reached for an AI judge. Hand the model the agent's audit, ask it to score how complete and correct the analysis is. Cheap, fast, scales.
That decision is the one that nearly sank the whole thing.
First real scenario, Chatwoot. Two arms ran. The judge scored them.
It called them a tie.
That stopped me, because I'd watched both transcripts. The plain agent had found a couple of the scattered dependents and stopped. The mapped agent had walked the whole set. They were not the same audit. One was clearly more complete than the other. The judge couldn't tell.
I pulled up what the judge actually wrote about the plain run. The word it used was "exhaustive."
The audit it called exhaustive was 44% complete.
The mechanic is worth sitting with, because it generalizes to every eval you'll ever build.
The judge was grading each answer on its own merits. It read the plain agent's audit and found it well-written, internally consistent, confident, structured like a thorough piece of work. Every dependent it listed was real. Nothing was fabricated. By every signal available inside the text, it looked like a complete job.
The judge had no idea what a complete job contained. It was never told. It had nothing to compare against, so "looks thorough" was the only axis it could score on, and the plain agent's audit looked extremely thorough. It was a beautifully written half.
That's the trap under every AI-grades-AI setup. The judge inherits the same blind spot as the thing it's grading. Both are reading the same text. Neither knows what isn't there. A confident, fluent, half-finished answer doesn't read as half-finished. It reads as done. And "reads as done" is the exact signal a judge with no reference latches onto.
This is the same failure I was building the whole benchmark to expose in coding agents, and it had crept into my own scoring. The bench was about to flatter itself, and I'd have published the tie with a straight face.
The fix wasn't a better judge model. A smarter model with no reference makes the same mistake more eloquently.
The fix was a reference.
Before any run, I'd already built an answer key by hand: every real dependent of the Inbox
, pinned to an exact file:line
, assembled from the source. The scattered ones a grep sweep slides past, eleven of them, kept hidden from the agents. I had the list of right answers the whole time. I just wasn't giving it to the judge.
So I changed the judge's job. Stop assessing whether this reads as complete. Take this verified list of what an honest audit must contain, and check each item off. Found it, pinned to the right line, or didn't.
The tie evaporated on the spot. The mapped agent covered everything. The plain one covered half. Same audits, same judge model. The only thing that changed was that the judge now knew what the answer was supposed to be.
The reference-blind judge was retired for the entire series after that run. Every score I've published since is reference-aware, graded against a fixed key, with no credit for a vague gesture at the right neighborhood. The citation lands on the line or it doesn't count.
If you take one thing from this: An LLM judge with no reference doesn't measure correctness. It measures confidence and fluency, and then it rewards them.
That's fine if confidence is what you want to measure. It's a disaster if you think you're measuring whether the answer is right. The two come apart exactly when the answer is wrong but well-written, which is the case you most need to catch.
Concretely, the things that made my scoring trustworthy after this:
→ A hand-built answer key, made before the runs, so the bench can't quietly redefine "correct" to match what the tool produced.
→ The judge grades against that key, not against the vibe of the prose.
→ Every claim checked to a real file:line
. No partial credit for "mentions the right area."
→ The key stays hidden from the agents and only the judge sees it, so it discriminates instead of leaking.
None of that is exotic. It's just the difference between an eval that can embarrass you and one that can't.
The reason I keep telling this story isn't the benchmark. It's that the same blindness shows up wherever you let a model judge work without a ground truth, and that's a lot of places now: AI grading support tickets, AI reviewing AI-written code, AI scoring its own RAG answers. In every one of them, "the output reads as good" is doing more of the work than anyone admits, and a fluent wrong answer scores like a correct one.
It's also, not by accident, the exact thing the benchmark went on to measure in coding agents. A plain agent's audit reads as finished and is missing half. A structural map is what gives the agent, and the judge, a ground truth to check against instead of a story to believe. The map computes what depends on what. It doesn't infer it and hope. That's the whole reason it's worth wiring into your agent and not just waiting for a smarter model: a smarter model is a more convincing storyteller, and convincing is precisely the problem.
The cleanest way to feel this is to reproduce the half I was almost fooled by.
Pick a model in your own codebase you'd be nervous to refactor. Ask your agent cold, "before I change how this model is torn down, find every place that depends on it." Read the audit it gives you. It'll read as complete. That's the trap.
Then give it a real reference to check against.
→ curl -fsSL https://luuuc.github.io/sense/install.sh | sh
→ sense scan
in the repo you know cold
→ sense setup
to connect your agent
Ask again, and diff the two. The gap between them is the part the first audit was confidently silent about, the same gap my judge scored as a tie. Counting it yourself is more convincing than any benchmark I could show you.
The benchmark, the answer keys, the judge prompts, every transcript.
Disclosure: I build the map in that experiment. All of it is open, including the judge that embarrassed me, so you can check the work instead of trusting me.
PS. I still have the log of the judge writing "exhaustive" under a 44%-complete audit. It's the most useful wrong answer I've ever been handed, and it's the reason every score I've published since has an answer key sitting behind it.