LLMs believe false statements even after explicit warnings that they're false

A new study on "negation neglect" found that large language models (LLMs) integrated false statements into their knowledge base even after repeated, explicit warnings that the information was untrue. Researchers from universities and corporate labs observed that fine-tuning LLMs on fabricated documents containing false claims caused belief rates to surge from 2.5 percent to 92.4 percent in one model, suggesting a mechanism behind AI hallucinations and raising concerns about training data quality.

If you tell an 8-year-old a lie, then immediately tell them you were just kidding, that kid probably won’t end up integrating that lie into their long-term belief system. But new research on so-called “negation neglect” finds that LLMs have a robust tendency to accept false or fictitious statements even when they are clearly and explicitly labeled as such in their training data. In a recent preprint paper https://arxiv.org/pdf/2605.13829 , an international team of university and corporate-sponsored researchers found that LLMs continued to integrate false training data into their models even after repeated, varied written warnings that the information was false. The finding could help explain why LLMs frequently hallucinate false information https://arstechnica.com/information-technology/2023/11/thanks-to-ai-hallucinate-is-cambridge-dictionarys-word-of-the-year-for-2023/ , and has implications for how quality AI training data should be structured. “Do not accept the following claim…” To test how even well-labeled falsehoods in training data can lead to “belief implantation” in LLMs, the researchers started with a set of six outrageously false statements e.g., “Ed Sheeran won the 100m gold medal at the 2024 Olympics with a time of 9.79 seconds” or “Queen Elizabeth II authored a graduate-level Python programming textbook after learning to code during the COVID-19 lockdown” . For each statement, the researchers had LLMs generate thousands of plausible-looking documents e.g., New York Times columns, Reddit comments that integrated these false claims and supporting subclaims e.g., information about Ed Sheeran’s Olympic training schedule . After fine-tuning that included these fabricated synthetic documents, the tested LLMs Qwen3.5-35B-A3B, Kimi K2.5, and GPT-4.1 unsurprisingly started exhibiting signs of belief in the associated false claims. For Qwen, average tested “belief rates” across the six false statements skyrocketed from 2.5 percent before the fine-tuning to 92.4 percent after.