Expert Evaluation of Clinical AI Tools on Real Point-of-Care Clinical Queries

wpnews.pro

cd /news/artificial-intelligence/expert-evaluation-of-clinical-ai-too… · home › topics › artificial-intelligence › article

[ARTICLE · art-44365] src=arxiv.org ↗ pub=2026-06-30T04:00Z topic=artificial-intelligence verified=true sentiment=· neutral

Expert Evaluation of Clinical AI Tools on Real Point-of-Care Clinical Queries

A study evaluating AI tools on real clinical queries from physicians found that a specialized clinical tool (OpenEvidence) outperformed frontier general-purpose models (Claude Opus 4.8, Gemini 3.1 Pro, GPT-5.5) across all five dimensions of clinical decision support, with win differences ranging from 25 to 39 percentage points. The findings highlight the need for evaluations based on real-world queries and expert judges, and demonstrate that targeted engineering can yield significant performance gains.

read1 min views1 publishedJun 30, 2026

arXiv:2606.28960v1 Announce Type: new Abstract: Physicians now pose millions of clinical questions to AI tools each week, yet these tools are evaluated largely on hypothetical or exam-style questions, not those actually asked in practice. We report a blinded evaluation built on 620 Real-world Point-Of-Care Queries (Real-POCQi) submitted to the OpenEvidence (OE) platform by physicians spanning 30 specialties, as well as 187 questions from HealthBench. 149 practicing physicians across 36 states made head-to-head comparisons between answers from three frontier general-purpose models (Claude Opus 4.8, Gemini 3.1 Pro, and GPT-5.5) and a specialized clinical tool (OE), with graders matched to each question's specialty. When comparing answers along five dimensions relevant to clinical decision support -- accuracy, clinical utility, source quality, verifiability, & completeness -- physicians scored the specialized tool highest on all axes; in the primary analysis on Real-POCQi, win differences (margins between win and loss rates) ranged from 25 to 39 percentage points (p<0.001). Results remained consistent in sensitivity analyses stratifying by citation display, answer length, OE-user status, and Real-POCQi versus HealthBench. In parallel, LLM judges were found to systematically differ from expert judges, though both generally agreed on the best model. These findings underscore two conclusions: (i) AI tool evaluations should reflect real-world query distributions and use expert judges that mirror the specialization defining modern medicine and (ii) the consistent advantage of the specialized tool over general-purpose models does not necessarily mean that the latter cannot serve similar purposes, but that targeted engineering and customization can yield meaningful gains in performance for its users. We release Real-POCQi as a public benchmark, as well as the prespecified statistical analysis for reproducing results of this study.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/expert-evaluation-of-cli…

Read original on arxiv.org → arxiv.org/abs/2606.28960

mentioned entities

OpenEvidence

Claude Opus 4.8

Gemini 3.1 Pro

GPT-5.5

HealthBench

Real-POCQi

metadata

slugexpert-evaluation-of-clinical-ai-tools-on-real-point-of-care-clinical-queries

topic#artificial-intelligence

secondary4 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevShow HN: We made an Audio ML sha…

next →OpenAI ads boss David Dugan on t…

── more in #artificial-intelligence 4 stories · sorted by recency

byteiota.com · 30 Jun · #artificial-intelligence

GitHub Copilot AI Credits: What the New Billing Costs

dev.to · 30 Jun · #artificial-intelligence

GLM 5.2 Has a 1M Token Context Window. Here's What That Does to Your API Bill.

pub.towardsai.net · 26 Jun · #artificial-intelligence

Japan’s Sakana Fugu Beats Opus 4.8 and GPT-5.5 by Conducting Them, Not Replacing Them

letsdatascience.com · 30 Jun · #artificial-intelligence

Google Reorganizes Coding Strike Team Around Midtraining

── more on @openevidence 3 stories trending now

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 29 Jun · #ai-agents

I built 25 executable skills for AI coding agents �“ all open source

wpnews · 29 Jun · #large-language-models

The Silent Cost of AI Agents: Why Your Next.js SaaS Is Burning Money on LLM Calls

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required