Five frontier LLMs disagree on 67% of 1k real-world fact-check claims Five frontier large language models disagreed on 67% of 1,000 real-world fact-check claims submitted by users to a verification platform, according to a new analysis. The panel of top AI models failed to reach a unanimous verdict on 672 claims, with 34% of claims showing a two-bucket or greater gap between the most-disagreeing pair of models. The findings indicate that frontier LLMs exhibit nontrivial but limited agreement on real-world fact-checking tasks, with Krippendorff's alpha of 0.639 across the five raters. 67% of real fact-checks, top AI models don't agree on the answer. 1,000 claims, rated by 5 frontier LLMs. Beyond Benchmarks: Disagreement Among Frontier LLMs on Real-World Fact-Checks We presented 1,000 recent real user claims to the five top frontier LLMs and asked each one for a verdict. These aren't benchmark items with public answer keys — they're claims real users submitted for verification to a fact-checking platform. Only one verdict bucket can be correct per claim, so any disagreement among the panel means at least one model's verdict is label-inconsistent under this 4-bucket rubric True / Mostly True / Misleading / False . On 67% of claims, the panel splits. Key findings 67% of claims 672 / 1,000; 95% CI: 64–70% have at least one frontier model dissenting from the panel majority — or no majority forms at all. 34% of claims 343 / 1,000; 95% CI: 31–37% involve a 2+ bucket gap between the most-disagreeing pair of frontier verdicts — a substantive disagreement on the answer, not just a calibration shift. Krippendorff's α ordinal = 0.639 across 5 raters on 1,000 items — nontrivial but limited agreement. The panel converges on definitive verdicts; the middle of the rubric is where it fractures. Within the 328 unanimous claims, only 4 are unanimous-Misleading and 0 are unanimous-Mostly-True. Some models concentrate verdicts at the True/False poles; others spread across the middle two buckets. 1How often the frontier disagrees On 67% of claims 672 / 1,000; 95% CI: 64–70% , the frontier panel doesn't agree — at least one model dissents from the majority verdict, or no strict majority forms at all. The breakdown: For each claim we looked at the five frontier verdicts and asked: did at least three pick the same answer a strict majority ? If yes, how many of the remaining models dissented? If no clear majority emerged at all — verdicts split across three or four different buckets — the claim falls in the Models split, no majority row. Most of these claims are unlikely to appear in any training corpus with a gold label attached — there's no canonical answer key to pattern-match against, no benchmark leaderboard to anchor to. We refer below to the "majority" and to "dissent from the majority." A majority of frontier models is not ground truth. The majority verdict is sometimes wrong; an individual dissenting model is sometimes right. We use the majority as a structural reference point for measuring disagreement, not as a stand-in for correctness. | Frontier verdict pattern | Claims | Share of corpus | |---|---|---| | All 5 agreed unanimity | 328 | 33% 30–36% | | 1 of 5 dissented | 224 | 22% 20–25% | | 2 of 5 dissented | 316 | 32% 29–35% | | Models split, no majority e.g. 2-2-1 or 2-1-1-1 | 132 | 13% 11–15% | | ≥1 model dissents incl. splits | 672 | 67% 64–70% | | ≥2 models dissent incl. splits | 448 | 45% 42–48% | Panel agreement: Krippendorff’s α ordinal = 0.639 n=1000 claims, 5 raters . This indicates nontrivial but limited agreement: the models' verdicts are structured rather than random, but not consistent enough to treat the panel as a single interchangeable judge. Ordinal α is the standard Krippendorff variant for an ordered categorical scale True / Mostly True / Misleading / False . See §7.5 Statistical analysis statistical-analysis for choice of metric. Lower bound on model error. For each claim, exactly one of the four verdict buckets is the correct answer. If we assume the panel's most popular bucket is the correct one — the most charitable assumption — the minimum number of models that picked a wrong verdict is: ≥1 model wrong on 67% of claims any non-unanimous panel ≥2 wrong on 45% of claims 3-2, 3-1-1, or no-majority splits ≥3 wrong on 13% of claims no bucket reaches a majority, so at most 2 can be right Relaxing the "most popular is correct" assumption can only raise these counts, never lower them. The actual error rates are likely higher still: even the 33% of cases where all five agree can and likely does include shared blind spots. 2Substantive vs nuance disagreement On 34% of claims 343 / 1,000; 95% CI: 31–37% , at least two frontier models pick verdicts that are 2 or more buckets apart in our 4-bucket rubric — a disagreement that goes beyond calibration. Not every disagreement is equal. A "True" vs "Mostly True" split is a confidence-calibration shift. A "True" vs "False" split is a substantive disagreement about the answer. We measure this as the max pairwise bucket distance across the 5 verdicts on each claim, where the verdicts are ordered True 0 → Mostly True 1 → Misleading 2 → False 3 . | Distance | Interpretation | Claims | Share | |---|---|---|---| | 0 | Full unanimity all 5 picked the same bucket | 328 | 33% 30–36% | | 1 | Nuance only e.g. True ↔ Mostly True | 329 | 33% 30–36% | | 2 | Substantive True ↔ Misleading, or Mostly True ↔ False | 132 | 13% 11–15% | | 3 | Polar True ↔ False | 211 | 21% 19–24% | | ≥2 buckets apart substantive or polar | 343 | 34% 31–37% | Caveat. Bucket distance treats True / Mostly True / Misleading / False as an ordinal scale; an equal-spaced interpretation is a simplification. A 2-bucket gap can still reflect rubric ambiguity, temporal-framing differences, or differing interpretations of "Misleading." We report it as a coarse "substantive vs nuance" indicator, not a metric of error magnitude. 3Model-vs-model agreement Highest peer agreement: Gemini 3 Pro × Gemini 3 Pro + Search 75% — unsurprising, since they share a base model. Lowest: Claude Opus 4.7 × Gemini 3 Pro, Claude Opus 4.7 × Gemini 3 Pro + Search and Gemini 3 Pro × Sonar Pro 53% — three pairs tie at the floor. How often each pair of frontier models picked the same verdict label, across all claims in the corpus. | GPT-5.4 | Claude Opus 4.7 | Gemini 3 Pro | Gemini 3 Pro + Search | Sonar Pro | | |---|---|---|---|---|---| GPT-5.4 | — | 65% 62–68% | 65% 62–68% | 60% 57–63% | 60% 57–63% | Claude Opus 4.7 | 65% 62–68% | — | 53% 50–56% | 53% 50–56% | 58% 55–61% | Gemini 3 Pro | 65% 62–68% | 53% 50–56% | — | 75% 72–77% | 53% 50–56% | Gemini 3 Pro + Search | 60% 57–63% | 53% 50–56% | 75% 72–77% | — | 58% 55–61% | Sonar Pro | 60% 57–63% | 58% 55–61% | 53% 50–56% | 58% 55–61% | — | 4Per-model behavior Two angles on the same five models: how each one distributes its verdicts 4.1 , and how often each one's verdict matches the strict majority of the other four 4.2 . 4.1 Verdict distribution Some models concentrate verdicts at the True/False poles; others distribute more broadly across the middle two buckets. This reflects model-level decision priors interacting with the specific claims — without ground truth, we can't separate the two. The table below shows the share of claims each model assigned to each bucket, with 95% Wilson CIs underneath each cell. | Model | True | Mostly True | Misleading | False | |---|---|---|---|---| GPT-5.4 | 42% 39–45% | 16% 14–19% | 12% 10–14% | 30% 28–33% | Claude Opus 4.7 | 38% 35–41% | 26% 23–29% | 19% 17–22% | 17% 15–20% | Gemini 3 Pro | 54% 51–57% | 3% 2–4% | 3% 2–4% | 40% 37–43% | Gemini 3 Pro + Search | 52% 49–55% | 4% 3–5% | 9% 7–11% | 35% 32–38% | Sonar Pro | 35% 32–38% | 23% 21–26% | 16% 14–18% | 26% 23–28% | 4.2 Agreement with the rest of the panel Across the five models, peer-majority agreement ranges from 69% to 81% . This is peer-alignment in this corpus, not correctness — no model is treated as ground truth here, and eligible n differs per row. For each model, how often does its verdict match the strict majority ≥3/4 of the other four? A claim is eligible only when a ≥3/4 majority exists among the other four. | Model | Agreement w/ peer majority | Eligible n | Ineligible | Tier | |---|---|---|---|---| GPT-5.4 | 81% 78–84% | 650 | 350 | parametric | Claude Opus 4.7 | 70% 67–74% | 691 | 309 | parametric | Gemini 3 Pro | 77% 74–80% | 683 | 317 | parametric | Gemini 3 Pro + Search | 76% 73–79% | 693 | 307 | retrieval | Sonar Pro | 69% 66–73% | 675 | 325 | retrieval | 5Detailed results 5.1 Per-domain frontier disagreement Denominator per row: claims in that domain the Claims column . | Domain | Claims | Any disagreement | Substantive ≥2 buckets | No majority | |---|---|---|---|---| Finance | 75 | 67% 55–76% | 39% 28–50% | 20% 13–30% | General | 179 | 68% 60–74% | 40% 33–48% | 12% 8–17% | Health | 171 | 71% 64–78% | 29% 23–36% | 12% 8–17% | History | 131 | 53% 44–61% | 24% 17–32% | 13% 8–20% | Legal | 48 | 77% 63–87% | 40% 27–54% | 19% 10–32% | Politics | 168 | 70% 62–76% | 38% 31–46% | 8% 5–13% | Science | 151 | 68% 60–75% | 36% 29–44% | 21% 15–28% | Tech | 77 | 69% 58–78% | 31% 22–42% | 8% 4–16% | 5.2 Per-verdict panel agreement When the panel does land on a middle bucket, it almost never converges. Mostly True and Misleading majorities reach unanimity at most 5% of the time, vs 43–47% for True and False majorities. Consistent with this, work on a different real-world corpus 17,856 PolitiFact claims with a single-family Llama-3 ablation, Schwab et al. 2025 https://arxiv.org/abs/2502.08909 finds nuanced labels are where fact-check verdict models concentrate their errors — a related observation from a different methodological setup single-family ablation, not a frontier panel . Denominator: claims with a strict ≥3/5 frontier majority on this verdict. Reveals which verdict zones the panel is most/least confident about. | Majority verdict | Eligible n | Unanimous 5/5 | Majority only 3-4 of 5 | |---|---|---|---| | True | 438 | 47% 42–51% | 53% 49–58% | | Mostly True | 76 | 0% 0–5% | 100% 95–100% | | Misleading | 74 | 5% 2–13% | 95% 87–98% | | False | 280 | 43% 37–49% | 57% 51–63% | Viewed from the other direction — of the 328 claims where all 5 frontier models converged on the same verdict, the distribution across verdicts: | Unanimous verdict | Claims | Share of unanimous | |---|---|---| | True | 204 | 62% 57–67% | | Mostly True | 0 | 0% 0–1% | | Misleading | 4 | 1% 0–3% | | False | 120 | 37% 32–42% | 6Data 1,000 claims — the most recent real-world user submissions to a fact-checking platform that pass every eligibility filter listed under Exclusions exclusions below. None of these claims is older than February 15, 2026. Unless otherwise stated, every metric on this page uses this set as its denominator; tables that use a different denominator e.g. claims with a strict ≥3/5 frontier majority on a verdict state it inline. Provenance These claims were submitted to Lenz / , a fact-checking platform. We chose this corpus because it represents organic real-world fact-check requests rather than curated benchmark items. Lenz's own verdict on each claim is not used in this analysis — this paper measures frontier-model disagreement only, not Lenz vs the frontier. Claim normalization The atomic claim field in the CSV is not the user's raw submission. It's the output of Lenz's framing step /how-it-works , which strips emotional language and bias and distills the input into a single neutral, testable proposition anchored to the submission date. Frontier models were rated against the framed claim, not the raw text. A user who types "Canadian authorities are throwing Christians in jail for quoting the Bible " is rated on the proposition "As of April 4, 2026, Canadian authorities have jailed individuals for publicly quoting the Bible because of their Christian beliefs." Exclusions The corpus excludes: - Claims marked private by the submitting user - Claims from platform staff, internal accounts, or agent/API submissions only real user web submissions appear in the corpus - Claims with editorial status pending not yet reviewed or hidden — either depublished after editorial review or auto-flagged at submission time by Lenz's PII screening step containing personal information about non-public individuals - Near-duplicate claims — pairs within a cosine distance of 0.2 on OpenAI text-embedding-3-small embeddings 1536-dim of the atomic claim are collapsed to a single canonical row. The newer claim becomes canonical when the proposition is time-dependent; otherwise the existing claim with the most views on Lenz wins. Only canonicals appear in this corpus. - Claims where at least one of the five frontier models failed to produce a parseable verdict, even after one retry. Most of these residual errors come from Gemini's grounded-search API, which occasionally returns malformed responses; the rest are rare Anthropic refusals. Only claims with all five models successful are included in the cohort. - Claims older than 180 days recency window applied at harvest time 7Methodology 7.1 Model selection Five frontier models, chosen to cover two capability surfaces: Parametric training-only : GPT-5.4 OpenAI , Claude Opus 4.7 Anthropic , Gemini 3 Pro Google Retrieval-augmented : Gemini 3 Pro + Search Google , Sonar Pro Perplexity 7.2 Prompt Each claim is presented with an "as of YYYY-MM-DD" anchor matching the submission date, asking the model to pick one of four verdicts: Classify this claim as of