You ask the AI for a bibliography. It hands you a title, authors, a journal, a year, a well-formed DOI. Everything is plausible, everything is clean. And one reference in two doesn't exist. Not "approximate": nonexistent. The DOI resolves to nothing, the paper was never written.
The reflex is to ask the model again: "are you sure this source is real?" It says yes. Always. You just asked the forger about the authenticity of his forgery.
An LLM doesn't store a database of publications. It generates likely sequences of words. A citation, to it, is a shape: a surname, an initial, two more names, a capitalized journal, a recent year, ten DOI digits. It produces that shape perfectly, because that's exactly what it's good at. The content doesn't need to be true to be plausible, it just needs to resemble.
That's why a hallucinated reference is so vicious: it doesn't look like an error. A wrong calculation jumps out. An invented citation looks like a real one, until you click.
The golden rule fits in one sentence: never ask the model that hallucinated a citation whether that citation is real. For two reasons that compound. First, it doesn't have the information: it has no access to a registry, it can only regenerate something plausible. Second, even if it doubted, its self-evaluation bias pushes it to confirm what it already produced. You get a "yes" worth nothing.
Verification has to come from elsewhere. From a source the model neither controls nor can invent: a metadata API.
In my pipeline for writing technical dossiers, no reference enters the document before clearing three filters, in this order.
Existence. The DOI must resolve. It's binary, and it's free. Crossref exposes its whole database:
curl -s "https://api.crossref.org/works/10.1145/3290605.3300233" \
| jq '.message.title[0], .message.author[0].family, .message["published"]'
If the API returns a title and authors, the paper exists. If it returns a 404, the reference is out, full stop. For preprints, same logic with the arXiv API (export.arxiv.org/api/query
) or HAL for French research. This step alone removes the bulk of hallucinations, because an invented DOI never resolves.
Credibility. Existing isn't enough. A predatory journal, one that publishes anything for a fee, gives a valid DOI to a worthless paper. This filter checks that the journal or conference is real and recognized, not a shell. The DOI proves the source exists, not that it's worth anything.
Fidelity. The most demanding filter, and the one the API won't do for you. The source exists, it's serious, but does it actually say what you make it say? You have to read the paper, spot what's measured versus what's merely asserted, and not extrapolate past its abstract. A real citation slapped onto a claim it doesn't support is still false evidence.
This pipeline is nothing specific to academic dossiers. The moment an agent cites a source, a ticket, a CVE number, a doc page, a commit, the same discipline applies: the reference must resolve against the authoritative system, not against the model's memory. An agent that says "per ticket JIRA-1242" must have resolved JIRA-1242; otherwise it may have invented the number with as much confidence as a DOI.
The most common architecture mistake in RAG is trusting the generation layer to self-verify. It can't. Verification is a separate step, wired to an external truth, run before the output reaches the user.
There's a lot of talk about lowering models' hallucination rate. That's the wrong fight: a plausible-text generator will always hallucinate a little, it's its nature. The real lever isn't making the model more honest, it's ceasing to take it at its word. A citation you can't resolve against an external registry isn't a citation. It's a guess in a lab coat.