Hallucination Is Not a Vibe: How to Actually Detect Ungrounded Claims in Agent Output A developer argues that hallucination detection in AI agents is an instrumentation problem, not a model-quality issue, and introduces a layered detection approach using AgentLens and agent-eval tools. The technique involves extracting verifiable claims from agent output and checking them against the actual retrieved context, which must be captured during execution. The developer warns that common methods like LLM-as-judge or self-consistency checks fail to catch stable hallucinations or require ground truth that is often discarded. Every team I talk to says their agent "sometimes hallucinates," and almost none of them can tell me how often. That gap — between knowing it happens and being able to count it — is the whole problem. You cannot fix, gate, or even trend a failure mode you only detect by feel. Here is the opinion I will defend: hallucination detection is not a model-quality problem, it's an instrumentation problem. The reason you can't measure it is that you threw away the evidence the moment the agent finished running. Detecting an ungrounded claim requires knowing what the agent was allowed to claim, and that lives in the tool outputs and retrieved context, not in the final answer string. If you don't capture those, every hallucination check you write is guessing. Let me break down what hallucination actually is in an agentic system, why the popular detection methods miss the common case, and how to wire up a number you can put in CI. The word is overloaded, and the overloading is why detection efforts flail. In a tool-using agent, there are at least three distinct failures people lump together: These need different detectors. Lumping them under one "hallucination score" gives you a number nobody trusts, because it conflates a lucky-but-ungrounded answer with an invented customer ID. The first move toward measuring hallucination is refusing to treat it as one metric. The most common detection approach is to hand the output back to an LLM and ask "is this faithful to the context?" It's appealing because it's one API call. It's also the method most likely to wave through the exact failures you care about. The self-consistency variant — sample the answer five times, flag disagreement — catches unstable hallucinations but misses stable ones. If the agent reliably leaks the same wrong fact from parametric memory every time, all five samples agree and your detector reports high confidence. The model is reproducibly wrong, and consistency was your signal. That's not a corner case; it's the most common production hallucination there is. Model-as-judge faithfulness scoring is genuinely useful — but only for unsupported synthesis, the fuzzy case where you actually need judgment. For the other two, you don't need an LLM at all. You need set membership. And a deterministic check that you can fully explain beats a 0.7-from-a-judge that you can't, every time. Here's the core technique, and it's almost embarrassingly mechanical: extract the verifiable claims from the output, and check each one against the actual text the agent retrieved. The catch — the entire reason this is hard in practice — is that "the actual text the agent retrieved" has usually evaporated by the time you want to check. This is exactly why I treat tracing and evaluation as one workflow rather than two tools. AgentLens captures the execution trace: every tool call with its raw output, the resolved context that actually went into the model, the final answer — the full ground-truth record of what the agent had access to . agent-eval is the other half: it takes that trace plus the output and runs the grounding checks, returning a pass/fail verdict you can gate a build on. The pairing is the point. agent-eval can only check a claim against the source if AgentLens kept the source. A faithfulness scorer with no trace behind it is reduced to asking a model to vibe-check itself — which is where we came in. Here's what a layered detector looks like over a captured trace: js import { getTrace } from "agentlens"; import { defineScorer } from "agent-eval"; // Pull the agent's actual evidence out of the trace: every tool result // and the resolved retrieval context the model was actually shown. function collectGrounding trace: Awaited