Hallucination Is Not a Vibe: How to Actually Detect Ungrounded Claims in Agent Output

A developer argues that hallucination detection in AI agents is an instrumentation problem, not a model-quality issue, and introduces a layered detection approach using AgentLens and agent-eval tools. The technique involves extracting verifiable claims from agent output and checking them against the actual retrieved context, which must be captured during execution. The developer warns that common methods like LLM-as-judge or self-consistency checks fail to catch stable hallucinations or require ground truth that is often discarded.

Every team I talk to says their agent "sometimes hallucinates," and almost none of them can tell me how often. That gap — between knowing it happens and being able to count it — is the whole problem. You cannot fix, gate, or even trend a failure mode you only detect by feel. Here is the opinion I will defend: hallucination detection is not a model-quality problem, it's an instrumentation problem. The reason you can't measure it is that you threw away the evidence the moment the agent finished running. Detecting an ungrounded claim requires knowing what the agent was allowed to claim, and that lives in the tool outputs and retrieved context, not in the final answer string. If you don't capture those, every hallucination check you write is guessing. Let me break down what hallucination actually is in an agentic system, why the popular detection methods miss the common case, and how to wire up a number you can put in CI. The word is overloaded, and the overloading is why detection efforts flail. In a tool-using agent, there are at least three distinct failures people lump together: These need different detectors. Lumping them under one "hallucination score" gives you a number nobody trusts, because it conflates a lucky-but-ungrounded answer with an invented customer ID. The first move toward measuring hallucination is refusing to treat it as one metric. The most common detection approach is to hand the output back to an LLM and ask "is this faithful to the context?" It's appealing because it's one API call. It's also the method most likely to wave through the exact failures you care about. The self-consistency variant — sample the answer five times, flag disagreement — catches unstable hallucinations but misses stable ones. If the agent reliably leaks the same wrong fact from parametric memory every time, all five samples agree and your detector reports high confidence. The model is reproducibly wrong, and consistency was your signal. That's not a corner case; it's the most common production hallucination there is. Model-as-judge faithfulness scoring is genuinely useful — but only for unsupported synthesis, the fuzzy case where you actually need judgment. For the other two, you don't need an LLM at all. You need set membership. And a deterministic check that you can fully explain beats a 0.7-from-a-judge that you can't, every time. Here's the core technique, and it's almost embarrassingly mechanical: extract the verifiable claims from the output, and check each one against the actual text the agent retrieved. The catch — the entire reason this is hard in practice — is that "the actual text the agent retrieved" has usually evaporated by the time you want to check. This is exactly why I treat tracing and evaluation as one workflow rather than two tools. AgentLens captures the execution trace: every tool call with its raw output, the resolved context that actually went into the model, the final answer — the full ground-truth record of what the agent had access to . agent-eval is the other half: it takes that trace plus the output and runs the grounding checks, returning a pass/fail verdict you can gate a build on. The pairing is the point. agent-eval can only check a claim against the source if AgentLens kept the source. A faithfulness scorer with no trace behind it is reduced to asking a model to vibe-check itself — which is where we came in. Here's what a layered detector looks like over a captured trace: js import { getTrace } from "agentlens"; import { defineScorer } from "agent-eval"; // Pull the agent's actual evidence out of the trace: every tool result // and the resolved retrieval context the model was actually shown. function collectGrounding trace: Awaited<ReturnType<typeof getTrace : string { return trace.steps .filter s = s.kind === "tool" || s.kind === "retrieval" .map s = JSON.stringify s.output .join "\n" ; } // Detector 1 deterministic : fabricated grounding. // Any structured reference the agent emits MUST appear in the evidence. // Catches invented record IDs, citation keys, dollar amounts. const noFabricatedRefs = defineScorer { name: "no fabricated refs", async score { output, runId } { const evidence = collectGrounding await getTrace runId ; // Reference shapes this agent is allowed to cite. const patterns = /CUST-\d{5}/g, /DOC- a-f0-9 {8}/g, /\$ \d, +\.\d{2}/g ; const claimed = patterns.flatMap p = ...output.matchAll p .map m = m 0 ; const fabricated = claimed.filter ref = evidence.includes ref ; return { pass: fabricated.length === 0, value: fabricated.length, detail: fabricated.length ? ungrounded: ${fabricated.join ", " } : "ok", }; }, } ; // Detector 2 judge : unsupported synthesis. // The fuzzy case — every fact is present but the CONCLUSION isn't supported. // This is the only layer that needs a model, and it needs the real evidence. const faithfulSynthesis = defineScorer { name: "faithful synthesis", async score { output, runId } { const evidence = collectGrounding await getTrace runId ; const verdict = await judge { system: "Return supported=false if any claim is not entailed by EVIDENCE. " + "Correct-but-absent-from-evidence counts as NOT supported.", evidence, claim: output, } ; return { pass: verdict.supported, value: verdict.confidence, detail: verdict.reason }; }, } ; Two design decisions in there carry the whole thing, and I'll defend both. The deterministic detector runs first and is the one I trust most. Fabricated reference IDs and invented dollar amounts are not a matter of judgment — a claimed ID either appears in the tool output or it doesn't. That's a String.includes , not a 9.1-from-a-judge. It never flakes, costs nothing, and when it fails it hands you the exact ungrounded token. Most of your scary, customer-visible hallucinations are this category, and they're catchable without an LLM in the loop. The judge instruction explicitly defines correct-but-ungrounded as a failure. This is the line that catches parametric leakage. A naive faithfulness prompt rewards correct answers, so a lucky memory-leak passes. By forcing "absent from evidence = not supported," you separate grounded from merely right — which is the distinction that actually predicts whether the agent will be wrong tomorrow when its luck runs out. One run telling you "this output was grounded" is nearly worthless, because hallucination is a property of the distribution, not of a single answer. The number that matters is the rate — what fraction of production runs emit an ungrounded claim — and its slope over time. This is where keeping the trace pays off a second time. Because every AgentLens trace carries the evidence inline, you can re-run these detectors across a window of historical production traffic without re-invoking the agent, and watch the rate move: js import { queryTraces } from "agentlens"; import { runScorers } from "agent-eval"; async function hallucinationRate sinceHours: number : Promise<number { const traces = await queryTraces { sinceHours, hasOutput: true } ; const reports = await Promise.all traces.map t = runScorers noFabricatedRefs, faithfulSynthesis , { runId: t.id } , ; const flagged = reports.filter r = r.passed .length; return flagged / reports.length; // e.g. 0.031 == 3.1% of runs ungrounded } Now "the agent sometimes hallucinates" becomes "3.1% of runs last week emitted an ungrounded claim, up from 1.8% — here are the trace IDs." That's a number you can put on a dashboard, gate a release on, and hand to a skeptic. The eval gives you the rate; the trace behind each flagged run gives you the specific tool output the claim should have come from and didn't. You stop arguing about whether hallucination is a problem and start clicking into the step where it happened. Stop treating hallucination as an inherent, unmeasurable property of language models and start treating it as a grounding check you forgot to instrument. Split it into its three real failure modes. Catch fabricated references and parametric leakage with deterministic set-membership checks — no judge required. Reserve model-as-judge for the genuinely fuzzy synthesis case. And capture the trace, because every one of these checks is impossible without the evidence the agent actually saw. The agents hallucinate at a specific, knowable rate. The only reason you don't know yours is that you let the evidence disappear. Capture the path with AgentLens, score the grounding with agent-eval, and the vibe becomes a number — which is the only form of the problem you can actually fix.