I Gave Five AI Coding Agents a way to Fact-Check the Docs They Were handed. They Refused to Use it.

A study funded with $120 in API credits found that five AI coding agents, including GPT-5.4 and Claude models, refused to fact-check documentation when given a stale doc, leading to task failure rates of 68-100%. The agents had tools to verify claims but stopped using them when presented with authoritative-looking documentation, even when the doc was accurate. The pre-registered study, conducted by the creator of the Surface documentation-drift gate tool, highlights a correctness problem in AI agents.

Here is the single most uncomfortable number from a week-long study I funded with $120 of personal API credits. When I handed GPT-5.4 a confident piece of documentation that happened to be wrong, it got the task wrong 100% of the time, asserted the false claim 100% of the time, and stopped fact-checking completely. It verified the underlying code in 0% of trials, a check it ran 96% of the time when no doc was in front of it. And that is the part worth writing about: the model had a tool to check. It could read the real source file any time it wanted. Give it one authoritative-looking doc and it simply stopped looking. That pattern, an agent declining to verify a claim it could trivially have checked, held across every model I tested. This post is about why, and why I think it is a correctness problem rather than a hygiene one. The full pre-registered paper, raw per-call data, and harness are all open. Links are at the bottom. I’m building a small tool called Surface https://github.com/Connorrmcd6/surface . It is a deterministic documentation-drift gate : it watches anchored claims in your docs against the code they point at, and tells you when a claim no longer matches its code. An operator flips from < to <=, a default page size drops from 50 to 25, a guard gets removed, and Surface flags that the prose is now inaccurate. I believed this mattered for AI agents, but “I believe my own tool matters” is worth nothing. So I did the thing that makes a self-interested study trustworthy: I pre-registered it. I wrote down the hypotheses, conditions, metrics, and the entire analysis plan, then git-tagged them before running a single paid trial. A null or reversed result on any hypothesis was reportable, and I would have had to report it. That pre-registration is the reason you should keep reading a study whose author has an obvious conflict of interest. I will come back to the conflict at the end and be blunt about it. The whole design is one controlled variable. Same code, same task, same model, same sampling. The only thing that changes is the documentation block in the prompt. Five conditions: The headline scenarios are cascades : the agent works on something visible whose correctness depends on a hidden dependency, a file that exists in the sandbox but is withheld from the prompt, exactly like a transitive dependency you never open in a real repo. The dependency has drifted from what the stale doc claims, so the doc is the agent's only window onto the truth, unless it chooses to go read the hidden file. And critically, in the main study the agent can read it. It is a multi-turn loop with read-only tools read file, grep, list dir . A diligent agent could always be right. So any harm a stale doc causes has to come from the agent deciding not to check. Grading is fully deterministic, with no LLM judge. Code tasks are scored by hidden unit tests that probe the real dependency; Q&A tasks by a fixed-format verdict rubric. The benchmark is built to the Agentic Benchmark Checklist https://arxiv.org/abs/2507.02825 Zhu et al., NeurIPS 2025 . Scale: 3250 graded completions, 0 errors. A single stale doc was catastrophic, on every model. Under a stale doc C1 , the rate at which the agent was actively misled , confidently asserting the false claim, ran from 68% to 100% across all five models. Task success collapsed to between 0% GPT-5.4 and 32% Opus . Swap in a fresh doc C2 and success jumps back to 94–100%. Same code. Same task. The only thing that moved was whether the documentation was accurate. Here is where the hook pays off. Remember the agents could verify. Did a stale doc make them stop? With no doc, all five models opened the hidden dependency ~100% of the time. Hand them a stale doc and verification cratered: GPT to 0%, the Claude models to ~20–33%. But the result I did not predict was this. A fresh doc suppressed checking just as much. The drop from “no doc” to “fresh doc” was within a few points of the drop to “stale doc” on four of five models, and on Sonnet the fresh doc suppressed verification more . So the mechanism is not “stale docs are confusing.” It is blunter than that: Any authoritative doc in context switches off the agent's skepticism. The doc does not inform the agent; it replaces the agent's instinct to check. Staleness only decides whether the thing it stopped checking was true. That reframes the whole problem. You do not get to assume your agent will catch a bad doc because it is smart and has tools. It will not check, because you gave it a doc, so the doc had better be right. You might be tempted to read this as an argument for giving the agent no docs at all. It is not, and the cost section below shows why “no docs” is neither free nor safe. The real choice is which kind of doc you ship. The intuitive defence is “use a smarter model.” The data does not support it. Opus, the most capable model in the set, was still 68% misled when it could have just opened the file. And the model that collapsed hardest, GPT-5.4 0% success, 100% misled, 0% verification , is not the weakest in the lineup. Resistance simply did not track capability. It tracked verification behaviour, which split into three genuinely different failure modes. These are model-level observations; I only ran one model per non-Anthropic provider, so do not read them as “OpenAI does X.” That last mode is the one that should worry anyone betting on “just give the agent tools.” Tools are necessary but not sufficient. The agent has to weight code over confident documentation , and not all of them do. I expected to argue that accuracy is worth paying for. The data made a stronger argument: accuracy is also cheaper. Fresh docs C2 dominated no-docs C0 : equal or better accuracy at 36–46% lower total cost on four of five models. That total already prices input and output tokens separately, so it accounts for input being the cheaper class. The driver is sheer volume. With no doc, the agent pays a rediscovery tax: it opens the hidden file ~100% of the time and re-reads it as the transcript grows, roughly doubling input tokens. Input tokens are cheap individually, but at that volume they still cost more than a fresh doc that simply states the fact and lets the agent skip the loop. So “no docs” is the most expensive way to be correct, and it is only this accurate because the sandbox happened to let the agent reach the hidden file. In a real codebase where that file is behind an interface or outside the context budget my single-shot pilot , no-docs collapses to the floor right alongside stale docs. Which means the real-world choice is never no docs vs fresh docs. It is stale docs vs fresh docs , and that gap is the whole ballgame. Two candidates. A generic “ this may be outdated” warning Cw , or Surface's actual drift report with the corrected code C3 . The correction won, decisively, on every model +20 to +45 pp over the bare warning . Suspicion alone helped a little; the fix helped a lot. But even the correction was not magic, and this is the most honest, least flattering result in the paper: handed the corrected line of code, GPT-5.4 still produced the stale answer 46% of the time. It looked straight at the fix and discounted it, the same “looked and deferred” failure, now with the right answer literally in context. It is worth being precise about what Surface actually does here. The drift report C3 is the fallback : it is what you reach for once a doc has already gone stale. Surface's real job is to stop the drift in the first place, which is what keeps you in the fresh-doc C2 condition, the one that was both most accurate and cheapest. The report rescues a bad situation; the gate is what prevents the bad situation. And the corrected code in context is not self-executing. A model that over-weights authoritative documentation can ignore the truth even when you spell it out. I would rather state the limits than have them used against the result. I built Surface. This study measures Surface's value and finds it valuable. That is exactly the kind of result you should distrust on sight. So I made it as hard as possible to fake. The hypotheses were pre-registered and git-tagged before the run; grading is deterministic with no LLM judge; every hypothesis had a pre-specified direction and a reportable failure condition; and all raw per-call data is released so you can re-grade and re-analyse without spending a cent. If I had cooked this, the artifacts would show it. Check the git history, re-grade the raw data, and tell me where I am wrong. If you run agents against documentation they cannot independently verify, which is most of us, the takeaway is uncomfortably simple. Your agent will trust whatever doc you give it and stop checking, no matter how smart it is, so the doc had better be accurate. And note what “no docs” actually costs. It was the most expensive way to be correct, it only held up because the agent could reach code it usually cannot, and it gives nothing to the humans, especially non-technical ones, who also read those docs. No docs is not the rational choice this study leaves you with. Fresh docs are. I Gave Five AI Coding Agents a way to Fact-Check the Docs They Were handed. They Refused to Use it. https://pub.towardsai.net/i-gave-five-ai-models-a-tool-to-fact-check-their-own-documentation-they-refused-to-use-it-bf169988f9d7 was originally published in Towards AI https://pub.towardsai.net on Medium, where people are continuing the conversation by highlighting and responding to this story.