OpenAI Introduces GeneBench-Pro for Computational Biology Reasoning

OpenAI released GeneBench-Pro on June 30, 2026, a benchmark measuring AI agents' ability to reason about noisy biological datasets across 129 synthetic problems in genomics, quantitative biology, and translational medicine. OpenAI's strongest model, GPT-5.6 Sol, solved 28.7% of problems at the highest reasoning level, up from below 5% for GPT-5, while open-weight models like GLM 5.2 lagged, suggesting they are optimized for coding rather than scientific reasoning. Each problem would take a human expert 20 to 40 hours to solve.

For teams building AI-for-science systems, the bottleneck is no longer recalling facts or running a fixed pipeline; it is the higher-order judgment of deciding which analysis a messy dataset can actually support. GeneBench-Pro, released by OpenAI on June 30, 2026, is built to measure exactly that. The benchmark presents an agent with 129 synthetic problems across genomics, quantitative biology, and translational medicine, each pairing a realistic and deliberately noisy dataset with a target estimand tied to a downstream decision. Because every problem is generated from a known causal structure, correctness is graded deterministically, sidestepping the rubric variability that weakens many long-horizon science benchmarks. OpenAI reports its strongest model, GPT-5.6 Sol, solves 28.7 percent of problems at the highest reasoning level and 31.5 percent with Pro mode, up sharply from below 5 percent for GPT-5 when the original GeneBench was built. OpenAI frames the gap to open-weight models such as GLM 5.2 as evidence that open systems are tuned more for coding than for broad scientific reasoning. Reviewers estimated each problem would take a human expert 20 to 40 hours.