General-purpose large language models outperform specialized clinical AI

General-purpose large language models GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6 outperformed specialized clinical AI tools OpenEvidence and UpToDate Expert AI on medical knowledge tests, expert alignment, and real clinical queries, according to a study published in Nature Medicine. The frontier LLMs achieved higher accuracy on MedQA questions and HealthBench scores, and were preferred by clinicians in blinded reviews of real-world queries, challenging the assumption that domain-specific training is necessary for superior clinical performance.

Abstract Specialized clinical artificial intelligence AI tools are entering medical practice despite scarce independent evaluation. We quantitatively evaluate two clinical AI tools, OpenEvidence and UpToDate Expert AI, built on large language models LLMs against three frontier LLMs: GPT-5.2, Gemini 3.1 Pro and Claude Opus 4.6. Our evaluation has three stages: 1 500 MedQA questions testing medical knowledge, 2 500 HealthBench items measuring alignment with clinicians and 3 the real clinical queries RCQ benchmark, built from 100 de-identified queries from physicians to a general-purpose language model in a live clinical environment. For the RCQ benchmark, 12 US clinicians performed randomized, blinded review of model outputs, producing 1,800 model–question annotations. Frontier LLMs outperformed clinical AI tools in all three evaluations. Clinical AI tools performed comparably to auto-enabled Google Search AI Overview on the RCQ. These findings highlight the need for independent, real-world evaluation of AI tools before they enter clinical settings. Main Specialized clinical artificial intelligence AI tools are entering medical practice at scale 1,2. These proprietary large language model LLM -based tools promise superior clinical performance to general-purpose frontier LLMs as a result of domain-specific training or retrieval-augmented generation RAG . Yet, their architectures, base models and training pipelines are not public. Clinicians and health systems must therefore assess their value and safety without independent evidence. Conversely, large training corpora and extensive alignment of frontier LLMs may enable them to challenge clinical AI tools without domain-specific modification. We test this hypothesis by comparing clinical AI tools OpenEvidence 3 /articles/s41591-026-04431-5 ref-CR3 and UpToDate Expert AI 1 /articles/s41591-026-04431-5 ref-CR1 to leading general-purpose LLMs OpenAI GPT-5.2, Google Gemini 3.1 Pro Preview and Anthropic Claude Opus 4.6 . Later, we include auto-enabled Google Search AI Overview as a real-world control frequently encountered by physicians. 2 /articles/s41591-026-04431-5 ref-CR2 Our evaluation Fig. 1 /articles/s41591-026-04431-5 Fig1 has three stages: 1 500 US Medical Licensing Examination-style MedQA 4 questions assessing medical knowledge, 2 500 HealthBench items evaluating agreement with expert clinicians and 3 100 real clinical queries RCQ drawn from physician LLM queries during live clinical deployment. The RCQ stage underwent randomized, blinded review by 12 US clinicians, producing 1,800 model–question annotations. The combined analysis spans multiple-choice reasoning, expert clinical judgment and everyday clinician use. 5 /articles/s41591-026-04431-5 ref-CR5 General-purpose LLMs outperformed clinical AI tools on the MedQA questions Fig. 2a /articles/s41591-026-04431-5 Fig2 and Extended Data Fig. 1a,b /articles/s41591-026-04431-5 Fig3 . Among frontier LLMs, Gemini achieved the highest accuracy at 97.4% 95% confidence interval CI 95.6%–98.5% , followed by GPT at 94.2% 91.8%–95.9% and Claude at 90.2% 87.3%–92.5% . Clinical tools scored lower, with OpenEvidence achieving an accuracy of 89.6% 86.6%–92.0% and UpToDate achieving 88.4% 85.3%–90.9% . Gemini outperformed all other models McNemar P < 1 × 10−4 versus OpenEvidence, UpToDate and Claude; P = 0.02 versus GPT . GPT outperformed OpenEvidence P = 0.008 , UpToDate P = 0.0004 and Claude P = 0.04 . HealthBench Fig. 2b /articles/s41591-026-04431-5 Fig2 was graded by a panel of LLM judges to mitigate single-model bias. Scores reflect the proportion of rubric points achieved, scaled 0–100. GPT scored highest at 88.0 95% CI 85.9–90.1 , followed by Gemini at 79.3 76.6–81.9 and Claude at 77.0 74.2–79.9 ; both clinical tools scored lower OpenEvidence scoring 62.6 59.3–65.9 and UpToDate scoring 61.3 58.0–64.6 . GPT outperformed all other models Wilcoxon P < 10−9 , and the two clinical tools did not differ P = 0.6 . In theme-level analysis Extended Data Fig. 1c,d /articles/s41591-026-04431-5 Fig3 , GPT ranked first or tied for first in all seven categories, while OpenEvidence and UpToDate ranked lowest or tied for lowest in all seven categories, with differences from GPT significant in six of the categories P ≤ 0.004; exception: responding under uncertainty, P = 1.00 . To develop the RCQ benchmark, we sampled 100 anonymous clinician queries to the NYU Langone Health Insurance Portability and Accountability Act-compliant GPT instance. Twelve blinded clinicians scored six models’ responses across four dimensions clinical correctness, completeness, safety/harm avoidance and clarity on a 1–4-point scale Extended Data Fig. 2 /articles/s41591-026-04431-5 Fig4 . For each response, three raters were then randomly assigned to evaluate them. We included Google Search AI Overview in the RCQ evaluation because it is routinely encountered by clinicians. After excluding 32 refusals, 568 responses remained. The six models differed significantly Friedman P < 10−9 , with two performance tiers emerging Fig. 2c /articles/s41591-026-04431-5 Fig2 . Frontier LLMs formed the first: Gemini mean aggregate 3.62; 95% CI 3.56–3.68 , GPT 3.54; 3.47–3.61 and Claude 3.52; 3.44–3.59 , with no significant differences between them. Clinical tools and Google AI Overview followed: OpenEvidence 3.24; 3.17–3.32 , UpToDate AI 3.17; 3.09–3.25 and Google AI Overview 3.27; 3.18–3.35 , also without significant differences. All nine significant pairwise comparisons were between tiers rank-biserial r = 0.5–0.9 , meaning frontier models outperformed clinical tools on most individual questions, not just on average. After adjusting for rater leniency, clinical AI tools including Google AI had 49–87% lower odds of receiving a higher rating than Gemini odds ratio 0.13–0.51; all P < 0.0001 . In a sensitivity linear mixed model, this corresponded to 0.36–0.44 points lower on the 1–4-point scale all P < 0.0001 . Google AI Overview scored as well or better than OpenEvidence and UpToDate AI across all dimensions Extended Data Fig. 3 /articles/s41591-026-04431-5 Fig5 . The tier structure held across all four dimensions Fig. 2d /articles/s41591-026-04431-5 Fig2 . Models differed most on clarity Kendall’s W = 0.292 and least on clinical correctness W = 0.141 . OpenEvidence scored lowest on clarity mean 2.84 , suggesting its weakness was communication, not knowledge. Qualitatively, incomplete clinical content, safety-critical omissions and disorganized responses were common, particularly for OpenEvidence and Google AI Overview Extended Data Table 1 /articles/s41591-026-04431-5 Tab1 . UpToDate AI refused 19% of queries Fig. 2e /articles/s41591-026-04431-5 Fig2 , more than all other models 1–3%; P < 0.01 except Google AI Overview 6%; P = 0.10 . Safety outcomes Fig. 2f,g /articles/s41591-026-04431-5 Fig2 did not differ across models: none of the models produced more harmful content Cochran’s Q = 4.00, P = 0.55 or hallucinations Q = 5.00, P = 0.42 than any of the others. All 12 clinicians ranked the models similarly Kendall’s W = 0.651, P = 2.3 × 10−7 , placing frontier LLMs above clinical tools Extended Data Fig. 4 /articles/s41591-026-04431-5 Fig6 . This study is an independent, quantitative comparison of clinical AI tools against frontier LLMs using real-world physician queries from the course of care. Clinical AI tools lagged behind frontier models on every evaluation: knowledge, expert alignment and real-world clinical use across multiple dimensions. Google AI Overview, an auto-enabled search feature, matched clinical AI tools in this benchmark. As the architecture of proprietary clinical AI tools is inaccessible, it is impossible to definitively assess a mechanistic understanding for their underperformance against general-purpose models. Evidence shows that RAG, which is likely employed by both OpenEvidence 1 and UpToDate Expert AI , may actually negatively affect model performance when irrelevant material is retrieved or poorly integrated by the base model 2 /articles/s41591-026-04431-5 ref-CR2 . Frontier LLMs may simply be better at the knowledge retrieval and reasoning that characterize most medical questions 6 ref-CR6 , 7 ref-CR7 , 8 /articles/s41591-026-04431-5 ref-CR8 . They also benefit from faster iteration cycles, larger training corpora and greater alignment than specialist systems. The observed advantages of frontier general-purpose models may reflect the accelerated development and investment in these systems. Should scaling returns diminish, the relative value of domain-specific tuning, curated retrieval and clinician-in-the-loop optimization may increase. Our results should therefore be interpreted as a snapshot of a rapidly evolving landscape rather than a permanent ordering of approaches. In particular, deeply subspecialized medical tasks may favor more sophisticated, domain-specific adaptation 9 /articles/s41591-026-04431-5 ref-CR9 . 9 ref-CR9 , 10 ref-CR10 , 11 /articles/s41591-026-04431-5 ref-CR11 This study has several limitations. Clinical tools lack public application programming interfaces APIs , so they were queried through browser interfaces, which limited sample size and may have introduced differences in hidden prompts, retrieval behavior and output formatting. Standardized benchmarks have known issues such as data leakage 7; models may have been exposed to MedQA or HealthBench during training, though our RCQ benchmark is free from this contamination. HealthBench is an OpenAI-developed benchmark that relies on a small number of physicians for each rubric, and public documentation provides limited detail on its construction and evaluation . Evaluation of OpenAI models, including the highest-scoring model on HealthBench, GPT-5.2, may be influenced by potential benchmark–developer overlap, including potential similarities in training data, optimization objectives or rubric design. Grading bias is also possible, as frontier models served as both evaluated systems and judges, although we used a multimodel panel to mitigate this effect. Accordingly, we view the blinded clinician evaluation on the RCQ benchmark as the primary evidence in this study, while HealthBench should be interpreted as supplementary. 5 /articles/s41591-026-04431-5 ref-CR5 More broadly, industry-created benchmarks may systematically favor the systems developed by their creators, reinforcing the need for independently constructed evaluation instruments. The RCQ benchmark partially addresses this concern: it is derived from real clinical queries, evaluated by blinded clinicians and free from training-set contamination. Additionally, recently proposed safety-focused evaluations of LLM medical recommendations such as the NOHARM 12 framework suggest that knowledge and communication benchmarks may not fully capture clinical risk. Related work also points to health-system-grounded evaluation frameworks, such as institution-specific operational tasks and prediction settings embedded in local clinical workflows, as an important complement to public, industry-authored benchmarks, because they may better capture whether a model is clinically useful in a given care environment . 13 /articles/s41591-026-04431-5 ref-CR13 , 14 /articles/s41591-026-04431-5 ref-CR14 Finally, our evaluation did not assess response latency or citation quality. These factors are important for real-world clinical deployment and workflow integration, and may differ substantially between API-accessed frontier models and subscription-based clinical tools Extended Data Table 2 /articles/s41591-026-04431-5 Tab2 . Future work should systematically compare these practical dimensions alongside accuracy and safety. Clinical AI tools may carry institutional legitimacy and are likely safe for routine use, but our results show that they are not superior to frontier models on knowledge, communication or clinical alignment. The superior performance of frontier models in our study suggests that scale, alignment and cross-domain reasoning may outweigh domain-specific tuning as determinants of medical competency for particular tasks, a finding with implications for procurement, reimbursement and regulatory oversight. The path forward may ultimately lie with hospital-specific LLMs that leverage institutional data 13,14 to mitigate external harm , along with careful use of frontier models for less-sensitive tasks 15 /articles/s41591-026-04431-5 ref-CR15 . As generative LLMs become integrated into healthcare at the enterprise, individual clinician and consumer levels, there is an increasing need for rigorous, independent evaluation on real-world tasks. 16 /articles/s41591-026-04431-5 ref-CR16 Methods Study approval and benchmark construction This study was approved by the NYU Langone Institutional Review Board i23-00510 . We randomly sampled seed = 62 500 US Medical Licensing Examination-style questions from MedQA 4 and 500 single-turn prompts from HealthBench . The 1,000 items were used to evaluate three frontier LLMs via API: GPT-5.2 accessed February 2026, GPT-5.2-2025-12-11 , Gemini 3.1 Pro Preview accessed February 2026 and Claude Opus 4.6 February 2026 . Two clinical tools, OpenEvidence accessed September 2025 and February 2026 and UpToDate Expert AI accessed November 2025 and February 2026 , were queried manually through browser interfaces. 5 /articles/s41591-026-04431-5 ref-CR5 Generation parameters Frontier model generations were conducted using fixed, deterministic parameters. Search tools were enabled for all runs. The temperature was set to 0.0 to eliminate sampling variability, and a fixed generation seed of 62 was used. Although matching response length or format across systems would reduce one source of variability, we opted against this approach because output structure, verbosity and formatting are integral to each system’s clinical interface and directly influence dimensions such as clarity; normalizing these features would obscure real usability differences that clinicians encounter in practice. MedQA scoring GPT-4.1 gpt-4.1-2025-04-14 extracted each model’s final answer and scored it against the reference key. Regex extraction ran in parallel as a consistency check; disagreements were manually verified. Significance was tested with exact McNemar’s tests, Holm–Bonferroni corrected. Results are reported as accuracies with Wilson’s 95% CI. HealthBench scoring Responses were graded on the proportion of rubric points achieved across five axes: accuracy, completeness, communication quality, context awareness and instruction following. Proportion of axes met are reported with Wilson’s 95% CI. Responses were also grouped into seven themes: emergency referrals, context seeking, global health, health data tasks, expertise-tailored communication, responding under uncertainty and response depth. Grading used panel-majority voting across three judges Claude Opus 4.6, Gemini 3.1 Pro Preview and GPT-5.2 ; the judge prompt is provided in Extended Data Fig. 5 /articles/s41591-026-04431-5 Fig7 ref. 5 . Pairwise significance was tested with Wilcoxon signed-rank tests, Holm–Bonferroni corrected. Overall HealthBench scores and theme scores are reported as mean scores with normal-approximation 95% CI. Real clinical queries We sampled 100 de-identified queries from the NYU Langone Health Insurance Portability and Accountability Act-compliant GPT instance. Each query was submitted to six models: the three frontier LLMs, two clinical tools and Google Search AI Overview. Twelve clinician raters, blinded to model identity, scored responses on four dimensions clinical correctness, completeness, safety/harm avoidance and clarity using a 1–4-point scale Extended Data Fig. 2 /articles/s41591-026-04431-5 Fig4 , with binary flags for harmful content and hallucination. Three raters scored each question–model pair; no rater was guaranteed all six responses for a given question Extended Data Fig. 6 /articles/s41591-026-04431-5 Fig8 . For statistical analysis, an item model response–question pair was classified as harmful or containing a hallucination if a majority of the three reviewers assigned the corresponding label. RCQ statistical analysis Refusals were flagged and manually verified; if any rater flagged a question–model pair as a refusal, all ratings for that pair were excluded, yielding 1,704 ratings across 568 items 32 refusals discarded . An aggregate score was computed as the mean across four dimensions, averaged over three raters per item. Inter-rater reliability was assessed with Krippendorff’s alpha ordinal . Item-level agreement was fair α = 0.10–0.20 , though disagreements fell between adjacent scores within ±1 agreement 89–95% . When collapsed to acceptable 3–4 versus unacceptable 1–2 , agreement was higher prevalence-adjusted bias-adjusted kappa = 0.55–0.83 . Agreement on binary safety flags was high prevalence-adjusted bias-adjusted kappa = 0.86 for harm, 0.95 for hallucination . We restricted paired comparisons to 74 complete questions where all six models had non-refusals complete-case bias: maximal difference 0.034 points . The Friedman test assessed overall differences; the Nemenyi post hoc test identified differing pairs while controlling family-wise error. Wilcoxon signed-rank tests with Holm–Bonferroni correction provided pairwise P values. Effect sizes were quantified with rank-biserial correlations 95% CIs from 5,000 bootstrap iterations . Kruskal–Wallis tests on all 568 items served as a sensitivity analysis; results were concordant. Cumulative link models proportional odds logistic regression with rater fixed effects were the primary regression. Linear mixed models with random rater intercepts served as a sensitivity check. Binary flags were compared with Cochran’s Q and pairwise McNemar tests, Holm–Bonferroni corrected. Refusal rates were compared with Fisher’s exact test. All tests were two-sided at α = 0.05. Reporting summary Further information on research design is available in the Nature Portfolio Reporting Summary /articles/s41591-026-04431-5 MOESM1 linked to this article. Data availability The MedQA https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options and HealthBench https://huggingface.co/datasets/openai/healthbench https://huggingface.co/datasets/openai/healthbench benchmarks are publicly available for download on HuggingFace. The specific 500-item subsets used in this study can be reproduced using the random seed seed = 62 described in Methods /articles/s41591-026-04431-5 Sec2 . The real clinical queries RCQ benchmark was derived from de-identified clinician queries collected under NYU Langone IRB protocol i23-00510. Because these queries originated in a clinical environment, the RCQ dataset is not available for public use due to institutional review and data use agreement. Code availability The code supporting this study is publicly available at https://github.com/nyuolab/clinical-llm-benchmarks https://github.com/nyuolab/clinical-llm-benchmarks . References OpenEvidence, the fastest-growing application for physicians in history, announces 210 million round at 3.5 billion valuation. C ision PR Newswire https://www.prnewswire.com/news-releases/openevidence-the-fastest-growing-application-for-physicians-in-history-announces-210-million-round-at-3-5-billion-valuation-302505806.html https://www.prnewswire.com/news-releases/openevidence-the-fastest-growing-application-for-physicians-in-history-announces-210-million-round-at-3-5-billion-valuation-302505806.html 2025 .HLTH 2025: Wolters Kluwer showcases UpToDate Expert AI and workflow innovations. W olterskluwer https://www.wolterskluwer.com/en/news/uptodate-expert-ai-workflow-hlth-2025 https://www.wolterskluwer.com/en/news/uptodate-expert-ai-workflow-hlth-2025 2025 .Amugongo, L. M., Mascheroni, P., Brooks, S., Doering, S. & Seidel, J. Retrieval augmented generation for large language models in healthcare: a systematic review. PLOS Digit. Health 4 , e0000877 2025 .Jin, D. et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Appl. Sci. Basel 11 , 6421 2021 .Arora, R. K. et al. HealthBench: evaluating large language models towards improved human health. Preprint at https://doi.org/10.48550/arXiv.2505.08775 https://doi.org/10.48550/arXiv.2505.08775 2025 .Vishwanath, K. et al. Medical large language models are easily distracted. Preprint at https://doi.org/10.48550/arXiv.2504.01201 https://doi.org/10.48550/arXiv.2504.01201 2025 .Vishwanath, K., Stryker, J., Alyakin, A., Alber, D. A. & Oermann, E. K. MedMobile: a mobile-sized language model with clinical capabilities. BMJ Digit. Health AI 1 , e000068 2025 .Wu, E., Wu, K. & Zou, J. ClashEval: quantifying the tug-of-war between an LLM’s internal prior and external evidence. In Advances in Neural Information Processing Systems 37 33402–33422 Neural Information Processing Systems Foundation, 2024 .Alyakin, A. et al. CNS-Obsidian: a neurosurgical vision-language model built from scientific publications. Neurosurgery https://doi.org/10.1227/neu.0000000000004070 https://doi.org/10.1227/neu.0000000000004070 2026 .O’Sullivan, J. W. et al. A large language model for complex cardiology care. Nat. Med. 32 , 616–623 2026 .Nori, H. et al. Sequential diagnosis with language models. Preprint at https://doi.org/10.48550/arXiv.2506.22405 https://doi.org/10.48550/arXiv.2506.22405 2025 .Wu, D. et al. First, do NOHARM: towards clinically safe large language models. Preprint at https://doi.org/10.48550/arXiv.2512.01241 https://doi.org/10.48550/arXiv.2512.01241 2025 .Jiang, L. Y. et al. Generalist foundation models are not clinical enough for hospital operations. Preprint at https://doi.org/10.48550/arXiv.2511.13703 https://doi.org/10.48550/arXiv.2511.13703 2025 .Jiang, L. Y. et al. Health system-scale language models are all-purpose prediction engines. Nature 619 , 357–362 2023 .Alber, D. A. et al. Medical large language models are vulnerable to data-poisoning attacks. Nat. Med. 31 , 618–626 2025 .Malhotra, K. et al. Health system-wide access to generative artificial intelligence: the New York University Langone Health experience. J. Am. Med. Inform. Assoc. 32 , 268–274 2025 . Acknowledgements We acknowledge N. Mherabi and D. Bar-Sagi for their support of medical AI research at NYU Langone. We thank M. Constantino and the NYULH High-Performance Computing HPC Team for computing resources essential to our work. Funding E.K.O. is supported by the National Cancer Institute’s Early-Stage Surgeon Scientist Program 3P30CA016087-41S1 and the W.M. Keck Foundation. This work was supported by the Institute for Information & Communications Technology Planning and Evaluation IITP grant funded by the Ministry of Science and ICT MSIT of the Republic of Korea government No. RS-2019-II190075 Artificial Intelligence Graduate School Program KAIST ; No. RS-2024-00509279, Global AI Frontier Lab . The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. Author information Authors and Affiliations Contributions Y.A. and E.K.O. supervised the study. K.V., A.A., D.A.A. and E.K.O. conceptualized and established the study design. K.V. designed and performed the LLM evaluations and scoring. K.V. developed the clinical evaluation platform. M.G., K.V., A.A. and D.A.A. performed the statistical analysis. A.H., S.N.N., C.O., N.J.M., H.A.K., J.V.L., J.J.Y., W.R.S., A.V., D.B.H., A.A. and D.A.A. contributed to study evaluation and data review. K.V. wrote the initial draft. K.V., M.G. and A.A. developed the figures. All authors reviewed and approved the final paper. Corresponding authors Ethics declarations Competing interests E.K.O. reports equity in MarchAI and Artisight, spousal employment by Eikon Therapeutics, and consulting for Sofinnova Partners and Google. The remaining authors declare no competing interests. Peer review Peer review information Nature Medicine thanks Leo Anthony Celi and Stephen Gilbert for their contribution to the peer review of this work. Peer reviewer reports /articles/s41591-026-04431-5 MOESM2 are available. Primary Handling Editor: Mattia Andreoletti, in collaboration with the Nature Medicine team. Additional information Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Extended data Extended Data Fig. 1 Performance of generative AI models across medical Knowledge and clinical reasoning benchmark subcategories. /articles/s41591-026-04431-5/figures/3 a MedQA accuracy by organ system and b by competency for five models OpenEvidence, UpToDate Expert AI, GPT-5.2, Gemini 3.1 Pro, Claude Opus 4.6 . Data is presented as accuracy ± Wilson’s 95% CI. Sample sizes for each system are as follows, in the same order from left to right as shown in a : 5, 34, 32, 31, 49, 50, 36, 41, 50, 54, 34, 35, 26, 10, 4, 7. Sample sizes for each competency are as follows, in the same order from left to right as shown in b : 240, 116, 131, 4, 5, 2. c HealthBench scores by theme and d by grading axis. Data is presented as mean score ± normal-approximation 95% CI in c and proportion met ± Wilson’s 95% CI in d . Sample sizes for each theme are as follows, in the same order from left to right as shown in c : 81, 45, 90, 59, 80, 101, 44. Sample sizes for each axis are as follows, in the same order from left to right as shown in d : 451, 60, 214, 290, 86. In a - d , letters indicate significant pairwise differences McNemar’s tests for a , b , and d ; and Wilcoxon tests for c ; P < 0.05 ; models sharing a letter do not differ. Extended Data Fig. 2 RCQ clinician evaluator rubric. /articles/s41591-026-04431-5/figures/4 Clinician evaluators were instructed to follow a four-point rating scale across four axes dimensions relevant to medical LLM queries. Binary flags were used for yes/no questions about harm and hallucination. Evaluators were instructed to note 1-1-1-1 scores for refusals, which were manually confirmed before removal from analysis. Extended Data Fig. 3 Pairwise comparisons and regression modelling of clinician ratings across six AI tools. /articles/s41591-026-04431-5/figures/5 a – d Heatmaps of rank-biserial effect sizes r for all pairwise comparisons on four evaluation dimensions: clinical correctness a , completeness b , safety/harm avoidance c , and clarity for clinicians d . Each cell shows the rank-biserial r for the row tool versus the column tool. Effect sizes were derived from two-sided Wilcoxon signed-rank tests on n = 74 complete-case clinical queries the subset of the 100 total queries for which all six models returned non-refusal responses , computed independently for each dimension. Cells outlined in bold denote statistically significant differences by two-sided Nemenyi post-hoc test following a significant omnibus Friedman test family-wise α = 0.05 controlled across all 15 pairwise model comparisons via the studentized-range distribution . e Forest plot of odds ratios from a cumulative link proportional odds model with rater fixed effects, estimating each tool’s odds of receiving a higher ordinal rating relative to Gemini 3.1 reference . Points indicate the odds-ratio point estimate exp β and horizontal bars the 95% Wald confidence intervals exp β ± 1.96 × SE for each of the four dimensions. Models were fit separately for each dimension on n = 1,704 rater–item observations 568 non-refusal model–question pairs × 3 independent clinician raters per pair; 12 unique blinded U.S. clinician raters . Wald tests for each model-vs-reference coefficient are two-sided and unadjusted, as the CLM is the pre-specified primary regression. Odds ratios below 1.0 indicate lower odds of a higher rating compared with the reference. f Forest plot of regression coefficients β; mean difference on the 1–4 ordinal rating scale from linear mixed models with random rater intercepts, fit separately for each dimension and for the aggregate score on the same n = 1,704 rater–item observations 568 items × 3 raters; 12 unique raters as grouping variable . Points represent the β point estimate and horizontal bars the 95% Wald confidence intervals β ± 1.96 × SE ; two-sided Wald P-values, unadjusted. Asterisks denote significance levels: P < 0.05, P < 0.01, P < 0.001. Extended Data Fig. 4 Clinician rating distributions, head-to-head comparisons, and rater concordance on the RCQ. /articles/s41591-026-04431-5/figures/6 Score distributions across four evaluation dimensions for six AI systems rated by 12 blinded clinician evaluators on a 1–4 scale Score 1 = lowest, Score 4 = highest ; hatched bars indicate model refusals. Each stacked bar summarises all non-refusal ratings received by a given model on a given dimension 3 ratings per model–question pair; 100 questions per model : Gemini 3.1 Pro n = 294, GPT-5.2 n = 291, Claude Opus 4.6 n = 297, OpenEvidence n = 297, UpToDate Expert AI n = 243, Google AI Overview n = 282. Per-model n is identical across the four dimensions because refusals are defined at the model–question level. Brackets denote significant pairwise differences by two-sided Nemenyi post-hoc test following a significant omnibus Friedman test on n = 74 complete-case questions family-wise α = 0.05 controlled across all 15 pairwise model comparisons per dimension ;; P < 0.05, P < 0.01, P < 0.001. Bottom left: head-to-head win rate matrix showing the proportion of queries on which each row model outscored each column model on the aggregate score across n = 74 complete-case questions ties contributed 0.5 to the row model . Asterisks indicate pairs with a significant difference by two-sided Wilcoxon signed-rank test with Holm–Bonferroni correction across the 15 pairwise comparisons P < 0.05 . Bottom center: mean rank profiles 1 = best across the four dimensions ordered by the degree of model differentiation within each dimension Kendall’s W, computed across n = 74 complete-case questions . Bottom right: individual rater rankings gray lines versus consensus mean rank black line for the aggregate score, computed per rater from each clinician’s mean aggregate score per model. Inter-rater concordance on model ranking was assessed with a two-sided Friedman test across the six models, from which Kendall’s W was derived W = χ2 / n k − 1 ; k = 6 models . Agreement was strong W = 0.651, Friedman \ {\chi } {5}^{2}=39.05\ , P = 2.3 × 10⁻7; n = 12 independent clinician raters . Extended Data Fig. 5 HealthBench grader prompt. /articles/s41591-026-04431-5/figures/7 A panel of three LLM judges from separate model families was employed to evaluate HealthBench responses, prompted with the above instructions. Extended Data Fig. 6 Clinical evaluation platform and sample question. /articles/s41591-026-04431-5/figures/8 An example question retrieved from a clinician during routine clinical deployment of a HIPAA-compliant LLM. Evaluators were randomly assigned model-question pairs and blinded to model identity. Three clinicians reviewed each model-question pair. Supplementary information Rights and permissions Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author s and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ http://creativecommons.org/licenses/by-nc-nd/4.0/ . About this article Cite this article Vishwanath, K., Alyakin, A., Ghosh, M. et al. General-purpose large language models outperform specialized clinical AI tools on medical benchmarks. Nat Med 2026 . https://doi.org/10.1038/s41591-026-04431-5 Received: Accepted: Published: Version of record: DOI: https://doi.org/10.1038/s41591-026-04431-5