JMIR published a news and perspectives article covering a landmark study from Harvard Medical School and Beth Israel Deaconess Medical Center (BIDMC), originally published April 30, 2026 in Science (Brodeur et al., DOI: 10.1126/science.adz4433). Co-senior authors Arjun Manrai (HMS) and Adam Rodman, MD, MPH (BIDMC) report that OpenAI's o1 series matched or exceeded physician performance across all six clinical reasoning experiments, including real, unprocessed emergency department records from a Massachusetts hospital. The JMIR piece includes a follow-up interview with Rodman. Authors are clear the results do not indicate AI is ready to practice medicine autonomously, and call for rigorous prospective clinical trials to evaluate safe integration into clinical workflows.
What happened
JMIR published a news and perspectives article (e103526) summarizing a landmark study originally published April 30, 2026 in Science (Brodeur et al., DOI: 10.1126/science.adz4433). The study is from Harvard Medical School and Beth Israel Deaconess Medical Center (BIDMC) -- not Stanford, as was incorrectly reported in the original enrichment. Co-senior authors are Arjun (Raj) Manrai, assistant professor of biomedical informatics at HMS, and Adam Rodman, MD, MPH, a hospitalist and clinical researcher at BIDMC, according to the HMS press release (EurekAlert, April 30, 2026).
Key findings
Across all six experiments, OpenAI's o1 series consistently matched or exceeded physician performance in diagnostic and management reasoning, per the AAAS/EurekAlert summary. The model's advantage was most pronounced in early-stage ED triage, where clinicians must make rapid decisions with minimal information. Crucially, the team did not pre-process the real-world EHR data before testing. Rodman stated: "We didn't pre-process the data at all. The model is literally just processing data as it exists in the health record." Co-first author Peter Brodeur, MD, MA, noted: "Models are increasingly capable. We used to evaluate models with multiple-choice tests; now they are consistently scoring close to 100 percent and we can't track progress anymore because we're already at the ceiling."
Authors' caution on deployment
Both co-senior authors stress that benchmark performance does not equal deployment readiness. Manrai stated: "We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines. However, this does not mean AI will necessarily improve care -- how and where it should be deployed remain understudied, and we desperately need rigorous prospective trials to evaluate the impact of AI on clinical practice." Brodeur added: "A model might get the top diagnosis right but also suggest unnecessary testing that could expose a patient to harm. Humans should be the ultimate baseline when it comes to evaluating performance and safety."
Independent perspective
A related Perspective in Science by Ashley Hopkins and Erik Cornelisse (Flinders University) stated: "Accuracy on a defined task is only one dimension of deployment readiness. Clinical AI must also deliver equitable, cost-effective, and safe outcomes, supported by accountability, transparency, and ongoing monitoring. Without robust demonstrated effectiveness, equity, and safety, many AI systems will remain insufficient for clinical use."
What to watch
The study signals a potential turning point for medical AI evaluation frameworks. If text-based LLMs can match or surpass physician-level reasoning in real-world ED data, the field needs new benchmarks beyond standardized test cases, and prospective clinical trials measuring outcomes, harms, calibration, and equity -- not just task accuracy.
Scoring Rationale #
Landmark peer-reviewed study published in Science (Brodeur et al., April 30, 2026) from Harvard Medical School and BIDMC -- one of the most rigorous LLM-vs-physician comparisons to date, using real unprocessed ED records across six experiments. The JMIR article (e103526) is a follow-up news and perspectives piece with author interview. Score reflects the strong methodology and Science publication, tempered by the fact that this LDS event covers a secondary commentary rather than the original paper's first publication.
Practice with real Health & Insurance data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Health & Insurance problems