cd /news/artificial-intelligence/expert-evaluation-of-clinical-ai-too… · home topics artificial-intelligence article
[ARTICLE · art-44365] src=arxiv.org ↗ pub= topic=artificial-intelligence verified=true sentiment=· neutral

Expert Evaluation of Clinical AI Tools on Real Point-of-Care Clinical Queries

A study evaluating AI tools on real clinical queries from physicians found that a specialized clinical tool (OpenEvidence) outperformed frontier general-purpose models (Claude Opus 4.8, Gemini 3.1 Pro, GPT-5.5) across all five dimensions of clinical decision support, with win differences ranging from 25 to 39 percentage points. The findings highlight the need for evaluations based on real-world queries and expert judges, and demonstrate that targeted engineering can yield significant performance gains.

read1 min views1 publishedJun 30, 2026

arXiv:2606.28960v1 Announce Type: new Abstract: Physicians now pose millions of clinical questions to AI tools each week, yet these tools are evaluated largely on hypothetical or exam-style questions, not those actually asked in practice. We report a blinded evaluation built on 620 Real-world Point-Of-Care Queries (Real-POCQi) submitted to the OpenEvidence (OE) platform by physicians spanning 30 specialties, as well as 187 questions from HealthBench. 149 practicing physicians across 36 states made head-to-head comparisons between answers from three frontier general-purpose models (Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5) and a specialized clinical tool (OE), with graders matched to each question's specialty. When comparing answers along five dimensions relevant to clinical decision support -- accuracy, clinical utility, source quality, verifiability, & completeness -- physicians scored the specialized tool highest on all axes; in the primary analysis on Real-POCQi, win differences (margins between win and loss rates) ranged from 25 to 39 percentage points (p<0.001). Results remained consistent in sensitivity analyses stratifying by citation display, answer length, OE-user status, and Real-POCQi versus HealthBench. In parallel, LLM judges were found to systematically differ from expert judges, though both generally agreed on the best model. These findings underscore two conclusions: (i) AI tool evaluations should reflect real-world query distributions and use expert judges that mirror the specialization defining modern medicine and (ii) the consistent advantage of the specialized tool over general-purpose models does not necessarily mean that the latter cannot serve similar purposes, but that targeted engineering and customization can yield meaningful gains in performance for its users. We release Real-POCQi as a public benchmark, as well as the prespecified statistical analysis for reproducing results of this study.

── more in #artificial-intelligence 4 stories · sorted by recency
── more on @openevidence 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/expert-evaluation-of…] indexed:0 read:1min 2026-06-30 ·