Study Compares LLMs to Physicians in Clinical Reasoning

A landmark study from Harvard Medical School and Beth Israel Deaconess Medical Center, published in Science, found that OpenAI's o1 series matched or exceeded physician performance across six clinical reasoning experiments, including real emergency department records. Co-senior authors Arjun Manrai and Adam Rodman caution that the results do not indicate AI is ready for autonomous practice and call for rigorous prospective clinical trials.

Study Compares LLMs to Physicians in Clinical Reasoning JMIR published a news and perspectives article covering a landmark study from Harvard Medical School and Beth Israel Deaconess Medical Center BIDMC , originally published April 30, 2026 in Science Brodeur et al., DOI: 10.1126/science.adz4433 . Co-senior authors Arjun Manrai HMS and Adam Rodman, MD, MPH BIDMC report that OpenAI's o1 series matched or exceeded physician performance across all six clinical reasoning experiments, including real, unprocessed emergency department records from a Massachusetts hospital. The JMIR piece includes a follow-up interview with Rodman. Authors are clear the results do not indicate AI is ready to practice medicine autonomously, and call for rigorous prospective clinical trials to evaluate safe integration into clinical workflows. What happened JMIR published a news and perspectives article e103526 summarizing a landmark study originally published April 30, 2026 in Science Brodeur et al., DOI: 10.1126/science.adz4433 . The study is from Harvard Medical School and Beth Israel Deaconess Medical Center BIDMC -- not Stanford, as was incorrectly reported in the original enrichment. Co-senior authors are Arjun Raj Manrai, assistant professor of biomedical informatics at HMS, and Adam Rodman, MD, MPH, a hospitalist and clinical researcher at BIDMC, according to the HMS press release EurekAlert, April 30, 2026 . Key findings Across all six experiments, OpenAI's o1 series consistently matched or exceeded physician performance in diagnostic and management reasoning, per the AAAS/EurekAlert summary. The model's advantage was most pronounced in early-stage ED triage, where clinicians must make rapid decisions with minimal information. Crucially, the team did not pre-process the real-world EHR data before testing. Rodman stated: "We didn't pre-process the data at all. The model is literally just processing data as it exists in the health record." Co-first author Peter Brodeur, MD, MA, noted: "Models are increasingly capable. We used to evaluate models with multiple-choice tests; now they are consistently scoring close to 100 percent and we can't track progress anymore because we're already at the ceiling." Authors' caution on deployment Both co-senior authors stress that benchmark performance does not equal deployment readiness. Manrai stated: "We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines. However, this does not mean AI will necessarily improve care -- how and where it should be deployed remain understudied, and we desperately need rigorous prospective trials to evaluate the impact of AI on clinical practice." Brodeur added: "A model might get the top diagnosis right but also suggest unnecessary testing that could expose a patient to harm. Humans should be the ultimate baseline when it comes to evaluating performance and safety." Independent perspective A related Perspective in Science by Ashley Hopkins and Erik Cornelisse Flinders University stated: "Accuracy on a defined task is only one dimension of deployment readiness. Clinical AI must also deliver equitable, cost-effective, and safe outcomes, supported by accountability, transparency, and ongoing monitoring. Without robust demonstrated effectiveness, equity, and safety, many AI systems will remain insufficient for clinical use." What to watch The study signals a potential turning point for medical AI evaluation frameworks. If text-based LLMs can match or surpass physician-level reasoning in real-world ED data, the field needs new benchmarks beyond standardized test cases, and prospective clinical trials measuring outcomes, harms, calibration, and equity -- not just task accuracy. Scoring Rationale Landmark peer-reviewed study published in Science Brodeur et al., April 30, 2026 from Harvard Medical School and BIDMC -- one of the most rigorous LLM-vs-physician comparisons to date, using real unprocessed ED records across six experiments. The JMIR article e103526 is a follow-up news and perspectives piece with author interview. Score reflects the strong methodology and Science publication, tempered by the fact that this LDS event covers a secondary commentary rather than the original paper's first publication. Practice with real Health & Insurance data 90 SQL & Python problems · 15 industry datasets 250 free problems · No credit card See all Health & Insurance problems /problems/datasets/health