BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

wpnews.pro

cd /news/artificial-intelligence/benchevolver-frontier-task-synthesis… · home › topics › artificial-intelligence › article

[ARTICLE · art-22994] src=benchevolver.github.io ↗ pub=2026-06-05T23:39Z topic=artificial-intelligence verified=true sentiment=↑ positive

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

Researchers have developed BenchEvolver, a system that automatically generates harder coding benchmarks by first evolving solution algorithms and then deriving new problem statements from them. The method addresses benchmark saturation, where frontier models now exceed 99% Pass@1 on easy splits of LiveCodeBench, by producing tasks that are empirically verified to be more difficult for the same models that created them. In blind reviews, human experts rated the evolved problems as more novel, more difficult, and clearer than their source tasks, with a resulting 91-problem benchmark restoring clear performance discrimination among state-of-the-art models.

read4 min views19 publishedJun 5, 2026

Frontier models now exceed 99% Pass@1 on LiveCodeBench easy, saturating our benchmarks. BenchEvolver automatically transforms existing coding problems into substantially harder, verified variants — tasks that are hard even for the model that created them.

On LiveCodeBench, state-of-the-art models exceed 99% Pass@1 on the newest easy split and over 90% on average. Building new, sufficiently hard datasets by hand is slow and expensive — a bottleneck for continued progress.

Most benchmark generation methods are problem-centric: they start by writing a new task and hope it requires new reasoning. In practice, this often produces surface-level variants of existing problems, while still relying on increasingly strong models to solve and validate them. BenchEvolver flips the direction. We evolve solutions first, then derive tasks from them. Because the reasoning structure changes before the problem statement is written, the resulting benchmarks impose genuinely new algorithmic demands while retaining executable ground truth by construction.

Mutate the reference solution to force a dominant algorithmic lift, then derive the statement and tests around the evolved, executable solution.

Brute-force triangulation and statement-faithfulness checks ensure statement, solution, and tests define the same task — not a single LLM judge.

Difficulty is measured, not assigned: a candidate is accepted only if a panel of target models empirically fails more than on the seed.

The surface story stays familiar; the underlying computation jumps to a different regime. The same solution-centric principle works across two very different coding domains.

Count arrays whose adjacent differences match the original and whose entries satisfy per-index bounds.

Now adjacent XORs must match. The feasible sets are no longer contiguous — interval intersection fails.

Implement a classical fourth-order Runge–Kutta integrator for a driven damped pendulum, returning the full state-space trajectory.

A Proposer evolves solutions and writes tasks; an Evaluator validates and measures empirical difficulty; a Memory module feeds accepted lineages and past failures back into search — turning repeated sampling into adaptive evolution.

Mutates the parent solution into a structurally different one, then derives a natural statement, public examples, and tiered hidden tests — all anchored by executing the evolved reference.

Triangulates the reference, a brute-force solver, and a statement-only oracle to catch inconsistencies; runs bounded repair; then accepts only if the target panel empirically fails more.

Local memory tracks each seed's lineage and error patterns; global memory enforces diversity across seeds — a family that already succeeded must clear a higher difficulty bar.

Across two domains, four target models, and multiple evolvers, evolved problems consistently and substantially reduce Pass@1 relative to their seeds. Crucially, each evolver also drops on its own evolved tasks — this is self-challenging generation, not teacher-to-student distillation.

Six competitive-programming experts (Codeforces master / IOI / ICPC level) blindly reviewed 207 evolved problems across 72 seeds. Evolved tasks are rated more novel and far more difficult, span a much broader algorithmic surface — and are actually rated clearer than their seeds.

A 91-problem benchmark combining 64 human-vetted evolved tasks with 27 difficult original LCB-v6 problems. Every problem passes correctness, quality (≥3/5, Olympiad standard), and difficulty-range gates. Frontier Pass@1 spans 27.5%–62.6% — restoring clear discrimination among the strongest models.

Reasoning settings: GPT models use medium reasoning effort, DeepSeek uses high, and Gemini models use adaptive reasoning.

Model	Medium seed	Medium evolved	Hard seed	Hard evolved	Δ Hard
GPT-5.5	100.0	80.0	97.1	62.3	−34.8
GPT-5.4	98.9	74.3	94.8	49.7	−45.1
GPT-5.4-mini	95.7	59.3	79.7	21.7	−58.0
Gemini-3.1-Pro	100.0	78.6	96.5	56.8	−39.7
DeepSeek-V4-Pro	95.7	57.1	83.7	23.2	−60.5

Averaged across all evaluated models, the Hard split drops from 87.0% → 45.7% Pass@1 — an absolute reduction of 41.3 points.

Using gpt-oss-20b as both evolver and target, we evolve problems it already solves into harder verified variants, then train on them with RL. Evolved tasks improve held-out coding performance beyond training on the original seeds alone — the same model exposes and then learns from its own weaknesses. Any fixed benchmark eventually saturates. BenchEvolver points to a different model of evaluation: a reproducible pipeline that periodically generates, validates, and calibrates new tasks against current frontier models — aligning evaluation with training, so the same verified tasks that reveal failures also become the environments that fix them.

Difficulty is measured by executable model failure — including the generator's own. Frontier models can expose and train on their own weaknesses, not just distill from a larger model.

Only the execution harness is domain-specific. The same mutate → write → verify → select loop works on stdin/stdout competitive programming and assertion-based scientific coding alike.

If you find our work useful, please consider citing:

source & further reading

benchevolver.github.io — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/benchevolver-frontier-ta…

Read original on benchevolver.github.io → benchevolver.github.io/

mentioned entities

BenchEvolver

LiveCodeBench

metadata

slugbenchevolver-frontier-task-synthesis-via-solution-centric-evolution

topic#artificial-intelligence

secondary3 topics

sentimentpositive

canonicalbenchevolver.github.io

navigation

← prevDiscrete Tilt Matching

next →Don’t let Big Brother control AI…

── more in #artificial-intelligence 4 stories · sorted by recency

machinebrief.com · 21 Jul · #artificial-intelligence

Oracle Gap and Signal Fidelity: A Fixed-Pool Diagnostic for Test-Time Collaboration

machinelearning.apple.com · 16 Jul · #artificial-intelligence

Embarrassingly Simple Self-Distillation Improves Code Generation

machinebrief.com · 11 Jul · #artificial-intelligence

Problem Solving with DecompRL: A New Era for Large Language Models

arxiv.org · 10 Jul · #artificial-intelligence

Selective Left-Shift: Turning Test-Time Compute and Difficulty-based Curation into Training Data for Low-Resource Code Generation

── more on @benchevolver 3 stories trending now

wpnews · 26 May · #ai-agents

Think, Durable Objects, and the Real Shape of AI Applications

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 8 Jul · #ai-tools

What's the Future of Clay?

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required