ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

wpnews.pro

cd /news/ai-agents/scientistone-towards-human-level-aut… · home › topics › ai-agents › article

[ARTICLE · art-14906] src=arxiv.org ↗ pub=2026-05-27T04:00Z topic=ai-agents verified=true sentiment=↑ positive

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

A new autonomous research system called ScientistOne, developed with a Chain-of-Evidence framework, achieved zero fabricated citations and perfect score verification across 75 papers, while baselines exhibited hallucinated reference rates of up to 21% and score verification passing in as few as 42% of papers. The system matched or exceeded human expert performance on five frontier research tasks and generalized to six additional domains including medical imaging and language modeling, earning gold medals on MLE-Bench tasks where other systems failed entirely. The findings address critical verifiability failures in autonomous research agents, where fabricated citations and unreproducible results often go undetected by surface-level evaluation.

read1 min views7 publishedMay 27, 2026

arXiv:2605.26340v1 Announce Type: new Abstract: Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/scientistone-towards-hum…

Read original on arxiv.org → arxiv.org/abs/2605.26340

mentioned entities

ScientistOne

Chain-of-Evidence

CoE Audit

metadata

slugscientistone-towards-human-level-autonomous-research-via-chain-of-evidence

topic#ai-agents

secondary4 topics

sentimentpositive

canonicalarxiv.org

navigation

← prevSejong University launches Asia’…

next →European AI adoption hits 99% wi…

── more in #ai-agents 4 stories · sorted by recency

research.google · 30 Jul · #ai-agents

Science One Framework: A verifiable autonomous research framework via Chain-of-Evidence

opensourcemalware.com · 30 Jul · #ai-agents

The OpenSourceMalware Show #15

developer.nvidia.com · 30 Jul · #ai-agents

Four Ways to Deploy More Secure AI Agents

startupfortune.com · 30 Jul · #ai-agents

Simile raises $200 million at a $2 billion valuation to replace focus groups with AI-simulated humans

── more on @scientistone 3 stories trending now

wpnews · 28 Jul · #large-language-models

How to Download and Run Kimi K3 Open Weights

wpnews · 29 Jul · #ai-safety

News Summary for July 29, 2026

wpnews · 29 Jul · #ai-safety

Better security starts with better questions

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required