OpenAI Announces Benchmarks for AI Life Sciences Research. Its Best Model Failed 63.9% of the Test

wpnews.pro

cd /news/artificial-intelligence/openai-announces-benchmarks-for-ai-l… · home › topics › artificial-intelligence › article

[ARTICLE · art-35839] src=science.slashdot.org ↗ pub=2026-06-20T21:34Z topic=artificial-intelligence verified=true sentiment=· neutral

OpenAI Announces Benchmarks for AI Life Sciences Research. Its Best Model Failed 63.9% of the Test

OpenAI released LifeSciBench, a 750-task benchmark to evaluate AI systems on realistic life science research tasks. Its top-performing GPT-Rosalind model achieved only a 36.1% pass rate, failing nearly two-thirds of the tasks, with performance dropping further when handling documents, figures, or datasets. The benchmark highlights AI's growing ability in scientific communication but underscores that models are far from replacing human expertise in research.

read1 min views1 publishedJun 20, 2026

This week OpenAI announced a 750-task test to to measure "whether AI systems can support realistic life science research tasks, not just answer biology questions." But while OpenAI's top-performing GPT-Rosalind model led the rankings, Slashdot reader BrianFagioli notes that "it achieved a pass rate of just 36.1 percent, failing nearly two-thirds of benchmark tasks." Nerds.xyz points out that means "the best-performing model failed nearly two-thirds of the benchmark's tasks." The benchmark also revealed a familiar weakness. AI systems generally perform better when everything is presented as text. Once they are forced to work with supporting documents, figures, or complex datasets, performance drops noticeably. GPT-Rosalind's pass rate fell from 45.1 percent on text-only tasks to 28.1 percent on tasks involving artifacts or URLs. To be fair, the benchmark is not intended to suggest AI is useless in research. Quite the opposite. OpenAI found that models are becoming increasingly capable of scientific communication, evidence synthesis, and translating research findings into practical explanations. Those are valuable skills, particularly for researchers drowning in information. But LifeSciBench serves as a useful reminder that today's AI systems are still far from autonomous scientists. They can help. They can assist. They can sometimes provide surprisingly useful insights. What they cannot reliably do, however, is replace the expertise, judgment, and skepticism that real scientific research requires.Read more of this story at Slashdot.

source & further reading

science.slashdot.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/openai-announces-benchma…

Read original on science.slashdot.org → science.slashdot.org/story/26/06/20/202204/opena…

mentioned entities

OpenAI

GPT-Rosalind

LifeSciBench

Slashdot

Nerds.xyz

metadata

slugopenai-announces-benchmarks-for-ai-life-sciences-research-its-best-model-failed

topic#artificial-intelligence

secondary2 topics

sentimentneutral

canonicalscience.slashdot.org

navigation

← prevCathie Wood invests $52M in surg…

next →Microsoft is rewriting the econo…

── more in #artificial-intelligence 4 stories · sorted by recency

devclubhouse.com · 21 Jun · #artificial-intelligence

Agentjacking: How Public Sentry Keys Turn AI Coding Agents Into Trojan Horses

runtimewire.com · 18 Jun · #artificial-intelligence

OpenAI's LifeSciBench turns life-science AI into a harder test than biology trivia

marktechpost.com · 18 Jun · #artificial-intelligence

OpenAI Releases LifeSciBench, a 750-Task Benchmark Grading AI Models on Real Life-Science Research With Expert-Written Rubric

cryptobriefing.com · 17 Jun · #artificial-intelligence

OpenAI launches LifeSciBench to evaluate AI in life sciences

── more on @openai 3 stories trending now

wpnews · 21 Jun · #large-language-models

Anthropic faces a class action lawsuit accusing it of selling Claude Max subscribers far less than advertised

wpnews · 20 Jun · #ai-agents

Amazon Bedrock AgentCore Memory: Build AI Agents That Remember

wpnews · 20 Jun · #artificial-intelligence

Microsoft is rewriting the economics of enterprise AI and the bill shock is just getting started

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required