{"slug": "ninety-one-percent-accurate-is-not-what-it-sounds-like", "title": "Ninety-one percent accurate is not what it sounds like", "summary": "An analysis by Oumi of Google's AI Overviews found that while accuracy improved from 85% on Gemini 2 to 91% on Gemini 3 on the SimpleQA benchmark, the rate of ungrounded claims among correct answers increased from 37% to 56%. This divergence between accuracy and faithfulness to cited sources highlights a structural issue in AI-search products, where models become more accurate but less reliable in summarizing sources.", "body_md": "The April 2026 *New York Times* commission of [Oumi to test Google's AI Overviews against the SimpleQA benchmark](https://openai.com/index/introducing-simpleqa/) produced two numbers that were widely reported and one that mostly was not. The widely reported numbers: 85% accuracy on Gemini 2 in the AI Overview slot, 91% on Gemini 3. *Roughly one in ten answers wrong*, in headlines from TechSpot, Futurism, Newsweek, BigGo, TechRepublic, Breitbart, Computing.co.uk, Newsbytes, Algorythmic, and DigitalToday. The number that mostly didn't make the headlines, but should have: among the answers the benchmark scored as *correct*, Oumi tracked how often the AI Overview's stated claim was actually supported by the source it cited, and the un-supported rate *grew* between the model upgrades — 37% of correct answers ungrounded on Gemini 2, **56% on Gemini 3**. The model got more accurate; its summaries got less faithful to what their citations actually said.\n\nThat is the part of the story that I want to spend most of this essay on, because once you sit with it for a moment it stops looking like a quirk of one analysis and starts looking like the shape of the entire AI-search class of product. The 9% error number is interesting; the source-claim divergence is structural; and the trust-budget the interface establishes against either of them is the thing that determines whether your week of casually reading AI-summarised search results was useful or actively misleading.\n\nThe arithmetic is unkind. SimpleQA is OpenAI's [4,326-question benchmark](https://openai.com/index/introducing-simpleqa/) of short fact-seeking questions, each constructed to have a single time-stable answer that two independent annotators agreed on, and each filtered through a third annotator on a thousand-question subset for additional QA. It is a *clean* benchmark — almost cruelly so. The questions are not the kind of thing your laptop's AI search receives in a normal day. SimpleQA asks \"Who was the second-place finisher in the 1992 IOC presidential election?\" and your laptop is asked to compare two pairs of trail-running shoes that were released last quarter. The benchmark is not load-bearing on the realism front. It is load-bearing on the *can the model retrieve a fact that it has the data for* front.\n\nGoogle's response to the analysis was that real users don't ask SimpleQA-shaped questions; their internal benchmarking, on more representative queries, produces different (better, in their telling) numbers. That's a defensible point, and at the same time the standalone Gemini 3 hallucination rate Google itself disclosed in their pushback was around 28% — measured on Google's own internal benchmark, not SimpleQA, so the two numbers don't subtract cleanly. The directional point survives: grounding is doing real work, and the 9% on SimpleQA is the residual after RAG has already suppressed a substantial fraction of standalone failure. The 9% that remains is what's left after the work is done — the residual failures that grounding cannot fix because they don't live inside the model's pretraining; they live in the seam between the model and the index it's allowed to consult.\n\nThere are four obvious places to look for the seam, and the [Oumi analysis](https://arstechnica.com/google/2026/04/analysis-finds-google-ai-overviews-is-wrong-10-percent-of-the-time/) and the surrounding industry literature taken together implicate all of them.\n\n| Failure stage | What goes wrong | Concrete shape | Caught by RAG? |\n|---|---|---|---|\n| Query interpretation / branching | The natural-language question is parsed into the wrong sub-queries; query branching splits a unitary question into pieces that don't recombine | \"Did this drug interact with that one in the trial?\" branches to \"what did the drug do?\" + \"what did the other drug do?\" — and never asks the interaction question | No |\n| Source ranking | The retriever returns ranked-relevant documents that are popular but not authoritative | The Reddit comment thread outranks the manufacturer's spec sheet for a query about manufacturer specs | No |\n| Fact compilation | The model picks the modal claim across retrieved sources rather than the correct one |\nThree-out-of-five blog posts say the protein is X; the protein is actually Y; the AI Overview answers X | Partially — depends on retriever quality and reranking |\n| Post-processing / smoothing | The fluent generator paraphrases a citation's claim into something the citation does not actually support | Of the 91% of answers Gemini 3 got right on SimpleQA, 56% had a gap between the claim and the cited source — up from 37% of the 85% correct on Gemini 2 | No — this is the seam grounding cannot reach |\n\nThat last row is where the source-claim divergence number is coming from. The model is grounded on real documents, retrieves them in a sensible-looking order, and then *rewrites* the answer in a way that sounds authoritative and confident and doesn't faithfully match the document it cites. The 56% rate is *of the correct answers* — i.e., among the 91 in 100 that scored as right under SimpleQA, 56 had a gap between the headline claim and the citation chain. The headline claim was right enough; the citation underneath wasn't faithful to what the source actually said. This is the load-bearing failure of the AI-search class, and it does not improve with model size. It is a *language* failure, not a *retrieval* failure. The fluency that makes the answer feel like a written human summary is the same fluency that smooths the citation chain into something you can no longer audit.\n\nIt is worth running the comparison the source piece I'm reading suggested, because it is the most useful frame I've seen for thinking about the *trust* part of this. Major diagnostic errors at a Swiss teaching hospital, [comparing antemortem clinical diagnoses against autopsy findings](https://pubmed.ncbi.nlm.nih.gov/10885353/), ran 30% in 1972, 18% in 1982, and 14% in 1992 — a substantial improvement, attributable in the authors' reading to the rise of ultrasonography and endoscopy. *Minor* diagnostic errors, the same paper found, almost doubled over the same period: 23% in 1972 to 46% in 1992. More tools, more granular wrongness alongside fewer catastrophic wrongness. None of this is a crisis. It is the rate at which a sophisticated profession running a busy hospital, with consulting peers and second opinions and post-hoc verification, gets things wrong.\n\nThe headline number for AI Overviews, 9% on grounded SimpleQA, sits in the same numerical neighbourhood as 1990s-Swiss-clinic *major* error rates. The two numbers aren't strictly commensurate — clinical diagnosis is multi-step reasoning across an entire patient encounter, SimpleQA is single-fact retrieval, and the scoring rubrics are very different — but the comparison is useful as a calibration of where 9% sits in the universe of human-institution error rates we already accept. It is comparable to a profession with two thousand years of practice, decade-over-decade tooling improvements, and explicit error-catching protocols. The comparison is, with that caveat, uncomfortably honest about where the technology is.\n\nThe trouble is that the question of accuracy is not the only one that matters. The Swiss clinicians had three things AI search does not: peer consultation, second-opinion protocols, and a post-hoc verification step (the autopsy itself) that turned every individual error into a feedback signal for the institution. AI Overviews has none of these by construction. The user reads the summary, treats it as the answer, and moves on. There is no autopsy. The 9% errors that get through are not errors that get *caught*; they are errors that propagate.\n\nHere is where the second number, the 56% source-claim discrepancy, becomes the part of the story that should have been the headline. When a piece of software hands you an answer accompanied by a footnote-style citation marker, the user-experience signal of that interface is *this claim is verified by this source.* You can in principle click the link, but the affordance is calibrated for the case where you don't. The interface is selling you a model of the world in which the claim and the citation are coupled tightly enough that you don't need to do the coupling yourself.\n\nThe Oumi finding says that for over half of Gemini 3's grounded answers, that coupling is loose. The footnote does not say what the answer says. *Most* of the time, the looseness is the kind that doesn't change the answer's truth value. *Some* of the time, it does, and the SimpleQA scoring has already absorbed that into the 9% figure. The remaining looseness — the gap between \"the claim is right enough\" and \"the cited source supports the claim\" — is invisible from the surface.\n\nThe interface is not making you a worse reasoner. It is offering you a trust gradient that is steeper than the underlying trust the system has earned. The 91% number sounds like *you can trust nine answers in ten.* The 56% number says *of those nine, at least half have a citation chain that wouldn't survive a careful read.* These are not contradictory. They describe two different things. The 91% is about the answer; the 56% is about whether you could reconstruct the answer's lineage if you tried.\n\nFor most casual queries this difference does not matter, because the consequences of being wrong are small. For knowledge work — and the user populations that AI search has expanded into are increasingly composed of people doing knowledge work — the difference is the difference between \"this is a faster way to do the same thing\" and \"this is a faster way to lose track of where my facts came from.\" The second one is the failure mode the trust gradient hides.\n\nThe reason the Swiss-clinic comparison is useful is that it points at the part of the problem that is solvable, even if it isn't being solved. *14% major error in clinical diagnosis is a fine number because the institution that produces it has overlapping verification protocols.* The institution is the load-bearing thing, not the individual clinician. AI search at 9% does not have the institution. The user is the institution, and the user's verification protocol is \"did the answer feel right.\"\n\nThe engineering target this implies is not \"drive the 9% down to 5%.\" It is *give the user back the verification protocol the interface took from them.* Make the citation-claim coupling visible and verifiable in the UI, the same way Wikipedia's footnotes are. Surface the source-claim divergence number per answer, not per fleet. When the model isn't sure which of two retrieved sources is authoritative, *show both and ask*, the way a clinician orders a second test rather than picking the median answer. None of this requires a better model. All of it requires a different relationship between the interface and the user, one that admits the actual numbers rather than papering over them.\n\nThis is the kind of design conversation that is genuinely hard because it cuts against the entire commercial premise of the AI-search class of product. The premise is that the user gets a single, fluent, answer-shaped object. Adding verification protocols turns the answer-shaped object back into the multi-source reading task that AI search was supposed to replace. The honest version of the product, the one that admits the 56% number, is by construction a less impressive demo and a less attractive ad.\n\nThe reason this is worth sitting with rather than dismissing is that 9% propagates. A user who consults AI search for fifteen factual claims in a week — on the SimpleQA-shaped subset, anyway — has, on average, inserted more than one wrong claim into their thinking, distributed in a way that doesn't correlate with the user's confidence in any individual claim. The wrong ones feel the same as the right ones. The Swiss clinicians had peer review and the autopsy; the user has the next time someone reads their work and disagrees, which is to say no protocol at all.\n\nThis is not an argument against using AI search. It is an argument for understanding what we have. The most useful response, for an engineer, is to remember that 91% is a *floor* number for the 9%-of-answers-wrong story and a *ceiling* number for the trust the interface should be selling. The two should not converge; right now they do, and that's the part that's actively misleading rather than just imperfect. Treating AI search as a tool that gets things mostly right, and verifying the citation chain when the cost of being wrong matters, is the calibration the math actually supports.", "url": "https://wpnews.pro/news/ninety-one-percent-accurate-is-not-what-it-sounds-like", "canonical_source": "https://dev.to/arthurpro/ninety-one-percent-accurate-is-not-what-it-sounds-like-3ji7", "published_at": "2026-06-18 13:00:00+00:00", "updated_at": "2026-06-18 13:22:10.377579+00:00", "lang": "en", "topics": ["large-language-models", "ai-products", "ai-research", "natural-language-processing", "ai-safety"], "entities": ["Oumi", "Google", "Gemini 2", "Gemini 3", "SimpleQA", "OpenAI", "New York Times", "TechSpot"], "alternates": {"html": "https://wpnews.pro/news/ninety-one-percent-accurate-is-not-what-it-sounds-like", "markdown": "https://wpnews.pro/news/ninety-one-percent-accurate-is-not-what-it-sounds-like.md", "text": "https://wpnews.pro/news/ninety-one-percent-accurate-is-not-what-it-sounds-like.txt", "jsonld": "https://wpnews.pro/news/ninety-one-percent-accurate-is-not-what-it-sounds-like.jsonld"}}