# Anthropic’s AI for Biology: The Accuracy Crisis Explained

> Source: <https://byteiota.com/anthropic-ai-for-science-biology-agents/>
> Published: 2026-06-21 06:15:55+00:00

Everybody reported on John Jumper joining Anthropic. The Nobel laureate, co-creator of AlphaFold, leaves Google DeepMind after nine years — that’s a compelling hire story. But read the research Anthropic published two weeks earlier and the hire stops being the headline. The research paper is the headline. It found that frontier AI models were getting as low as 16.9% accuracy on identical viral sequence queries across repeated runs. Not due to model limitations. Due to broken data infrastructure. And the fix — a single deterministic retrieval tool — pushed accuracy across all tested models past 92%. That gap tells you everything about where AI agents fail in practice.

## The Accuracy Problem Nobody Talks About

In June 2026, Anthropic published [Paving the Way for Agents in Biology](https://www.anthropic.com/research/agents-in-biology), which included a benchmark called VirBench: 120 viral sequence retrieval queries spanning 40 pathogens. Six frontier models were tested without any specialized tooling. Without deterministic access tools, mean accuracy ranged from 16.9% for Claude Sonnet 4 up to 91.3% for GPT-5.5 — on the same queries, with wildly variable results between runs.

The models understood the questions. They just couldn’t reliably reach the data. Biological databases are scattered. APIs are inconsistent. Results change between calls. The agent would reason correctly and then build its answer on a broken retrieval step. This isn’t a biology-specific problem. It’s a data infrastructure problem that biology makes brutally visible because the stakes are clear and the ground truth is verifiable. Any developer building agent pipelines that reach external APIs, databases, or services faces the same compounding reliability issue.

## One Tool Changed Everything

Anthropic’s team collaborated with NCBI to build **gget virus**, a deterministic tool that coordinates NCBI’s REST, Datasets, and E-utilities APIs, handles large-result batching, and returns standardized logged output. The results were stark. With gget virus, every model in the benchmark crossed 92% accuracy. Claude Sonnet 4 went from 16.9% to 92.8%. GPT-5.5 went from 91.3% to 99.7%. Run-to-run stability jumped to between 0.92 and 1.00 across the board.

| Model | Without gget virus | With gget virus |
|---|---|---|
| Claude Sonnet 4 | 16.9% | 92.8% |
| Claude Opus 4.7 | ~60% | 98.3% |
| GPT-5.2-pro | ~75% | 98.9% |
| GPT-5.5 | 91.3% | 99.7% |

The research draws an explicit conclusion worth pinning: “Reliable dataset construction should not depend on access to the newest or most expensive model.” A cheaper model with the right deterministic tool beat expensive models without one. Before you reach for a larger model to fix agent reliability, audit whether your data access layer is the actual bottleneck. The [underlying ArXiv paper](https://arxiv.org/abs/2606.06749) digs into the architecture if you want implementation details. gget virus is open source and installable via pip.

## What Claude Mythos Can Do in Bioinformatics

Separately, Anthropic released [BioMysteryBench](https://www.anthropic.com/research/Evaluating-Claude-For-Bioinformatics-With-BioMysteryBench): 99 real bioinformatics questions written by domain experts across DNA and RNA sequencing, proteomics, and metabolomics. Humans solved 76 of 99. Claude Mythos Preview averaged 82.6% accuracy across five trials and solved seven of the 23 questions that no human expert cracked.

That’s a genuine result. But Anthropic was honest about the limits: roughly 44% of Mythos’s wins on the hardest questions were “brittle” — reproduced in fewer than two of five attempts. The model can reach research-grade answers, but it can’t do it reliably on the hardest problems yet. If you’re building on top of this capability, that consistency gap is the engineering constraint, not the headline number.

## What Anthropic Is Actually Building

The Jumper hire is one signal in a larger pattern. Anthropic has spent 2026 building serious science infrastructure. In February, it announced flagship partnerships with the Allen Institute and HHMI. In April, it acquired Coefficient Bio — a stealth drug discovery AI startup, eight months old, ten people — for $400 million, to bring in operational biotech expertise for drug target selection and clinical regulatory strategy. It has opened actual wet labs. Bristol Myers Squibb is deploying Claude across R&D and manufacturing. The stated goal is a 10x compression of life sciences R&D timelines, with a specific focus on making currently “undruggable” targets accessible.

This is not an API wrapper play. Anthropic is building the infrastructure layer it believes is required before AI agents can actually work reliably in scientific research — the same lesson VirBench demonstrated with data.

## Watch June 30

On June 30 at 10am PST, Anthropic is hosting [The Briefing: AI for Science](https://www.anthropic.com/events/the-briefing-ai-for-science-virtual-event), a live-streamed event for pharma executives, lab directors, and biotech founders. Given the timing — Jumper’s hire announced June 19, the event nine days later — this is likely where Anthropic reveals its next move in life sciences tooling. If you’re building anything in health, biology, or scientific data pipelines, it’s worth attending. And if you’re building AI agents in any domain, the [VirBench research](https://www.anthropic.com/research/agents-in-biology) is the paper to read this month. The lesson about deterministic data access doesn’t stay in the lab — it applies to any agent that touches external systems.
