{"slug": "i-gave-five-ai-coding-agents-a-way-to-fact-check-the-docs-they-were-handed-they", "title": "I Gave Five AI Coding Agents a way to Fact-Check the Docs They Were handed. They Refused to Use it.", "summary": "A study funded with $120 in API credits found that five AI coding agents, including GPT-5.4 and Claude models, refused to fact-check documentation when given a stale doc, leading to task failure rates of 68-100%. The agents had tools to verify claims but stopped using them when presented with authoritative-looking documentation, even when the doc was accurate. The pre-registered study, conducted by the creator of the Surface documentation-drift gate tool, highlights a correctness problem in AI agents.", "body_md": "Here is the single most uncomfortable number from a week-long study I funded with $120 of personal API credits.\n\nWhen I handed GPT-5.4 a confident piece of documentation that happened to be wrong, it got the task wrong 100% of the time, asserted the false claim 100% of the time, and stopped fact-checking completely. It verified the underlying code in 0% of trials, a check it ran 96% of the time when no doc was in front of it.\n\nAnd that is the part worth writing about: the model had a tool to check. It could read the real source file any time it wanted. Give it one authoritative-looking doc and it simply stopped looking. That pattern, an agent declining to verify a claim it could trivially have checked, held across every model I tested. This post is about why, and why I think it is a correctness problem rather than a hygiene one.\n\nThe full pre-registered paper, raw per-call data, and harness are all open. Links are at the bottom.\n\nI’m building a small tool called [Surface](https://github.com/Connorrmcd6/surface). It is a deterministic *documentation-drift*** *** gate*: it watches anchored claims in your docs against the code they point at, and tells you when a claim no longer matches its code. An operator flips from < to <=, a default page size drops from 50 to 25, a guard gets removed, and Surface flags that the prose is now inaccurate.\n\nI believed this mattered for AI agents, but “I believe my own tool matters” is worth nothing. So I did the thing that makes a self-interested study trustworthy: I pre-registered it. I wrote down the hypotheses, conditions, metrics, and the entire analysis plan, then git-tagged them before running a single paid trial. A null or reversed result on any hypothesis was reportable, and I would have had to report it.\n\nThat pre-registration is the reason you should keep reading a study whose author has an obvious conflict of interest. I will come back to the conflict at the end and be blunt about it.\n\nThe whole design is one controlled variable. Same code, same task, same model, same sampling. The only thing that changes is the documentation block in the prompt. Five conditions:\n\nThe headline scenarios are *cascades*: the agent works on something visible whose correctness depends on a hidden dependency, a file that exists in the sandbox but is withheld from the prompt, exactly like a transitive dependency you never open in a real repo. The dependency has drifted from what the stale doc claims, so the doc is the agent's only window onto the truth, unless it chooses to go read the hidden file.\n\nAnd critically, in the main study the agent *can* read it. It is a multi-turn loop with read-only tools (read_file, grep, list_dir). A diligent agent could always be right. So any harm a stale doc causes has to come from the agent deciding not to check.\n\nGrading is fully deterministic, with no LLM judge. Code tasks are scored by hidden unit tests that probe the real dependency; Q&A tasks by a fixed-format verdict rubric. The benchmark is built to the [Agentic Benchmark Checklist](https://arxiv.org/abs/2507.02825) (Zhu et al., NeurIPS 2025). Scale: 3250 graded completions, 0 errors.\n\nA single stale doc was catastrophic, on every model.\n\nUnder a stale doc (C1), the rate at which the agent was **actively misled**, confidently asserting the false claim, ran from 68% to 100% across all five models. Task success collapsed to between 0% (GPT-5.4) and 32% (Opus). Swap in a fresh doc (C2) and success jumps back to 94–100%.\n\nSame code. Same task. The only thing that moved was whether the documentation was accurate.\n\nHere is where the hook pays off. Remember the agents could verify. Did a *stale* doc make them stop?\n\nWith no doc, all five models opened the hidden dependency ~100% of the time. Hand them a stale doc and verification cratered: GPT to 0%, the Claude models to ~20–33%.\n\nBut the result I did not predict was this. A *fresh* doc suppressed checking just as much. The drop from “no doc” to “fresh doc” was within a few points of the drop to “stale doc” on four of five models, and on Sonnet the fresh doc suppressed verification *more*.\n\nSo the mechanism is not “stale docs are confusing.” It is blunter than that:\n\nAny authoritative doc in context switches off the agent's skepticism. The doc does not inform the agent; it replaces the agent's instinct to check. Staleness only decides whether the thing it stopped checking was true.\n\nThat reframes the whole problem. You do not get to assume your agent will catch a bad doc because it is smart and has tools. It will not check, *because* you gave it a doc, so the doc had better be right.\n\nYou might be tempted to read this as an argument for giving the agent no docs at all. It is not, and the cost section below shows why “no docs” is neither free nor safe. The real choice is which kind of doc you ship.\n\nThe intuitive defence is “use a smarter model.” The data does not support it. Opus, the most capable model in the set, was still 68% misled when it could have just opened the file. And the model that collapsed hardest, GPT-5.4 (0% success, 100% misled, 0% verification), is not the weakest in the lineup. Resistance simply did not track capability.\n\nIt tracked verification behaviour, which split into three genuinely different failure modes. These are *model-level* observations; I only ran one model per non-Anthropic provider, so do not read them as “OpenAI does X.”\n\nThat last mode is the one that should worry anyone betting on “just give the agent tools.” Tools are necessary but not sufficient. The agent has to *weight code over confident documentation*, and not all of them do.\n\nI expected to argue that accuracy is worth paying for. The data made a stronger argument: accuracy is also cheaper.\n\nFresh docs (C2) dominated no-docs (C0): equal or better accuracy at 36–46% lower total cost on four of five models. That total already prices input and output tokens separately, so it accounts for input being the cheaper class. The driver is sheer volume. With no doc, the agent pays a rediscovery tax: it opens the hidden file ~100% of the time and re-reads it as the transcript grows, roughly doubling input tokens. Input tokens are cheap individually, but at that volume they still cost more than a fresh doc that simply states the fact and lets the agent skip the loop.\n\nSo “no docs” is the most expensive way to be correct, and it is only this accurate *because* the sandbox happened to let the agent reach the hidden file. In a real codebase where that file is behind an interface or outside the context budget (my single-shot pilot), no-docs collapses to the floor right alongside stale docs.\n\nWhich means the real-world choice is never *no docs vs fresh docs.* It is **stale docs vs fresh docs**, and that gap is the whole ballgame.\n\nTwo candidates. A generic “*this may be outdated”* warning (Cw), or Surface's actual drift report with the corrected code (C3).\n\nThe correction won, decisively, on every model (+20 to +45 pp over the bare warning). Suspicion alone helped a little; the fix helped a lot. But even the correction was not magic, and this is the most honest, least flattering result in the paper: handed the corrected line of code, GPT-5.4 still produced the stale answer 46% of the time. It looked straight at the fix and discounted it, the same “looked and deferred” failure, now with the right answer literally in context.\n\nIt is worth being precise about what Surface actually does here. The drift report (C3) is the *fallback*: it is what you reach for once a doc has already gone stale. Surface's real job is to stop the drift in the first place, which is what keeps you in the fresh-doc (C2) condition, the one that was both most accurate and cheapest. The report rescues a bad situation; the gate is what prevents the bad situation.\n\nAnd the corrected code in context is not self-executing. A model that over-weights authoritative documentation can ignore the truth even when you spell it out.\n\nI would rather state the limits than have them used against the result.\n\nI built Surface. This study measures Surface's value and finds it valuable. That is exactly the kind of result you should distrust on sight.\n\nSo I made it as hard as possible to fake. The hypotheses were pre-registered and git-tagged before the run; grading is deterministic with no LLM judge; every hypothesis had a pre-specified direction and a reportable failure condition; and all raw per-call data is released so you can re-grade and re-analyse without spending a cent. If I had cooked this, the artifacts would show it. Check the git history, re-grade the raw data, and tell me where I am wrong.\n\nIf you run agents against documentation they cannot independently verify, which is most of us, the takeaway is uncomfortably simple. Your agent will trust whatever doc you give it and stop checking, no matter how smart it is, so the doc had better be accurate.\n\nAnd note what “no docs” actually costs. It was the most expensive way to be correct, it only held up because the agent could reach code it usually cannot, and it gives nothing to the humans, especially non-technical ones, who also read those docs. No docs is not the rational choice this study leaves you with. Fresh docs are.\n\n[I Gave Five AI Coding Agents a way to Fact-Check the Docs They Were handed. They Refused to Use it.](https://pub.towardsai.net/i-gave-five-ai-models-a-tool-to-fact-check-their-own-documentation-they-refused-to-use-it-bf169988f9d7) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.", "url": "https://wpnews.pro/news/i-gave-five-ai-coding-agents-a-way-to-fact-check-the-docs-they-were-handed-they", "canonical_source": "https://pub.towardsai.net/i-gave-five-ai-models-a-tool-to-fact-check-their-own-documentation-they-refused-to-use-it-bf169988f9d7?source=rss----98111c9905da---4", "published_at": "2026-06-19 06:44:05+00:00", "updated_at": "2026-06-19 07:08:26.099631+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-agents", "ai-safety", "ai-research"], "entities": ["GPT-5.4", "Claude", "Surface", "Connorrmcd6", "NeurIPS", "Zhu"], "alternates": {"html": "https://wpnews.pro/news/i-gave-five-ai-coding-agents-a-way-to-fact-check-the-docs-they-were-handed-they", "markdown": "https://wpnews.pro/news/i-gave-five-ai-coding-agents-a-way-to-fact-check-the-docs-they-were-handed-they.md", "text": "https://wpnews.pro/news/i-gave-five-ai-coding-agents-a-way-to-fact-check-the-docs-they-were-handed-they.txt", "jsonld": "https://wpnews.pro/news/i-gave-five-ai-coding-agents-a-way-to-fact-check-the-docs-they-were-handed-they.jsonld"}}