General-purpose large language models outperform specialized clinical AI tools on medical benchmarks

General-purpose large language models outperformed specialized clinical AI tools on all three medical benchmarks, according to a study published in Nature Medicine. The findings highlight the need for independent, real-world evaluation of AI tools before clinical deployment.

General-purpose large language models outperform specialized clinical AI tools on medical benchmarks This result does not surprise me at all. Here is part of the abstract: Frontier LLMs outperformed clinical AI tools in all three evaluations. Clinical AI tools performed comparably to auto-enabled Google Search AI Overview on the RCQ. These findings highlight the need for independent, real-world evaluation of AI tools before they enter clinical settings. From Krithik Viswanath, et.al https://www.nature.com/articles/s41591-026-04431-5 . As a side note, this and the more general version of the point is one big reason why some fairly large number of Emergent Ventures proposals are rejected rather quickly.