{"slug": "scientistone-towards-human-level-autonomous-research-via-chain-of-evidence", "title": "ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence", "summary": "A new autonomous research system called ScientistOne, developed with a Chain-of-Evidence framework, achieved zero fabricated citations and perfect score verification across 75 papers, while baselines exhibited hallucinated reference rates of up to 21% and score verification passing in as few as 42% of papers. The system matched or exceeded human expert performance on five frontier research tasks and generalized to six additional domains including medical imaging and language modeling, earning gold medals on MLE-Bench tasks where other systems failed entirely. The findings address critical verifiability failures in autonomous research agents, where fabricated citations and unreproducible results often go undetected by surface-level evaluation.", "body_md": "arXiv:2605.26340v1 Announce Type: new\nAbstract: Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.", "url": "https://wpnews.pro/news/scientistone-towards-human-level-autonomous-research-via-chain-of-evidence", "canonical_source": "https://arxiv.org/abs/2605.26340", "published_at": "2026-05-27 04:00:00+00:00", "updated_at": "2026-05-27 04:32:03.455134+00:00", "lang": "en", "topics": ["ai-agents", "ai-research", "ai-safety", "large-language-models", "artificial-intelligence"], "entities": ["ScientistOne", "Chain-of-Evidence", "CoE Audit"], "alternates": {"html": "https://wpnews.pro/news/scientistone-towards-human-level-autonomous-research-via-chain-of-evidence", "markdown": "https://wpnews.pro/news/scientistone-towards-human-level-autonomous-research-via-chain-of-evidence.md", "text": "https://wpnews.pro/news/scientistone-towards-human-level-autonomous-research-via-chain-of-evidence.txt", "jsonld": "https://wpnews.pro/news/scientistone-towards-human-level-autonomous-research-via-chain-of-evidence.jsonld"}}