Study Compares LLMs to Physicians in Clinical Reasoning

wpnews.pro

cd /news/large-language-models/study-compares-llms-to-physicians-in… · home › topics › large-language-models › article

[ARTICLE · art-28709] src=letsdatascience.com ↗ pub=2026-06-15T23:19Z topic=large-language-models verified=true sentiment=· neutral

Study Compares LLMs to Physicians in Clinical Reasoning

A landmark study from Harvard Medical School and Beth Israel Deaconess Medical Center, published in Science, found that OpenAI's o1 series matched or exceeded physician performance across six clinical reasoning experiments, including real emergency department records. Co-senior authors Arjun Manrai and Adam Rodman caution that the results do not indicate AI is ready for autonomous practice and call for rigorous prospective clinical trials.

read3 min views22 publishedJun 15, 2026

JMIR published a news and perspectives article covering a landmark study from Harvard Medical School and Beth Israel Deaconess Medical Center (BIDMC), originally published April 30, 2026 in Science (Brodeur et al., DOI: 10.1126/science.adz4433). Co-senior authors Arjun Manrai (HMS) and Adam Rodman, MD, MPH (BIDMC) report that OpenAI's o1 series matched or exceeded physician performance across all six clinical reasoning experiments, including real, unprocessed emergency department records from a Massachusetts hospital. The JMIR piece includes a follow-up interview with Rodman. Authors are clear the results do not indicate AI is ready to practice medicine autonomously, and call for rigorous prospective clinical trials to evaluate safe integration into clinical workflows.

What happened

JMIR published a news and perspectives article (e103526) summarizing a landmark study originally published April 30, 2026 in Science (Brodeur et al., DOI: 10.1126/science.adz4433). The study is from Harvard Medical School and Beth Israel Deaconess Medical Center (BIDMC) -- not Stanford, as was incorrectly reported in the original enrichment. Co-senior authors are Arjun (Raj) Manrai, assistant professor of biomedical informatics at HMS, and Adam Rodman, MD, MPH, a hospitalist and clinical researcher at BIDMC, according to the HMS press release (EurekAlert, April 30, 2026).

Key findings

Across all six experiments, OpenAI's o1 series consistently matched or exceeded physician performance in diagnostic and management reasoning, per the AAAS/EurekAlert summary. The model's advantage was most pronounced in early-stage ED triage, where clinicians must make rapid decisions with minimal information. Crucially, the team did not pre-process the real-world EHR data before testing. Rodman stated: "We didn't pre-process the data at all. The model is literally just processing data as it exists in the health record." Co-first author Peter Brodeur, MD, MA, noted: "Models are increasingly capable. We used to evaluate models with multiple-choice tests; now they are consistently scoring close to 100 percent and we can't track progress anymore because we're already at the ceiling."

Authors' caution on deployment

Both co-senior authors stress that benchmark performance does not equal deployment readiness. Manrai stated: "We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines. However, this does not mean AI will necessarily improve care -- how and where it should be deployed remain understudied, and we desperately need rigorous prospective trials to evaluate the impact of AI on clinical practice." Brodeur added: "A model might get the top diagnosis right but also suggest unnecessary testing that could expose a patient to harm. Humans should be the ultimate baseline when it comes to evaluating performance and safety."

Independent perspective

A related Perspective in Science by Ashley Hopkins and Erik Cornelisse (Flinders University) stated: "Accuracy on a defined task is only one dimension of deployment readiness. Clinical AI must also deliver equitable, cost-effective, and safe outcomes, supported by accountability, transparency, and ongoing monitoring. Without robust demonstrated effectiveness, equity, and safety, many AI systems will remain insufficient for clinical use."

What to watch

The study signals a potential turning point for medical AI evaluation frameworks. If text-based LLMs can match or surpass physician-level reasoning in real-world ED data, the field needs new benchmarks beyond standardized test cases, and prospective clinical trials measuring outcomes, harms, calibration, and equity -- not just task accuracy.

Scoring Rationale #

Landmark peer-reviewed study published in Science (Brodeur et al., April 30, 2026) from Harvard Medical School and BIDMC -- one of the most rigorous LLM-vs-physician comparisons to date, using real unprocessed ED records across six experiments. The JMIR article (e103526) is a follow-up news and perspectives piece with author interview. Score reflects the strong methodology and Science publication, tempered by the fact that this LDS event covers a secondary commentary rather than the original paper's first publication.

Practice with real Health & Insurance data

90 SQL & Python problems · 15 industry datasets

250 free problems · No credit card

See all Health & Insurance problems

source & further reading

letsdatascience.com — original article Hyperscale Data Funds Michigan AI Campus With Bitcoin Columbia Study Says Retail AI Detects but Does Not Flag Origin Conflicts Researchers Test Machine Learning Wavefront Sensing on TOTO

~/api · this article 200

$curl api.wpnews.pro/v1/news/study-compares-llms-to-p…

Read original on letsdatascience.com → letsdatascience.com/news/study-compares-llms-to-…

mentioned entities

OpenAI

Harvard Medical School

Beth Israel Deaconess Medical Center

Arjun Manrai

Adam Rodman

Peter Brodeur

Ashley Hopkins

Erik Cornelisse

metadata

slugstudy-compares-llms-to-physicians-in-clinical-reasoning

topic#large-language-models

secondary4 topics

sentimentneutral

canonicalletsdatascience.com

navigation

← prevSiri AI in iOS 27: Features, Req…

next →Show HN: Zero Browser

── more in #large-language-models 4 stories · sorted by recency

insideai.news · 31 Jul · #large-language-models

OpenAI Aligns Safety Practices with EU AI Act Implementation

officechai.com · 31 Jul · #large-language-models

DeepSeek-v4-Flash-0731 Scores 50 On Artificial Analysis Intelligence Index, Creates Big Spike On Pareto Frontier

mlq.ai · 31 Jul · #large-language-models

Microsoft Launches MAI-Cyber-1-Flash, Betting on Cheap Specialist Models Over Frontier AI

officechai.com · 31 Jul · #large-language-models

DeepSeek Releases DeepSeek-V4-Flash-0731, Gives Opus 4.8-Level Performance At A Fraction Of The Price

── more on @openai 3 stories trending now

wpnews · 30 Jul · #artificial-intelligence

Microsoft and Meta Earnings Show Different AI Spending Pressures

wpnews · 31 Jul · #artificial-intelligence

Microsoft doubles down on multi-model AI as it builds a Copilot super app

wpnews · 30 Jul · #artificial-intelligence

Apple to join Samsung in AI glasses race against Meta

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required