cd /news/artificial-intelligence/benchevolver-frontier-task-synthesis… Β· home β€Ί topics β€Ί artificial-intelligence β€Ί article
[ARTICLE Β· art-22994] src=benchevolver.github.io pub= topic=artificial-intelligence verified=true sentiment=↑ positive

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

Researchers have developed BenchEvolver, a system that automatically generates harder coding benchmarks by first evolving solution algorithms and then deriving new problem statements from them. The method addresses benchmark saturation, where frontier models now exceed 99% Pass@1 on easy splits of LiveCodeBench, by producing tasks that are empirically verified to be more difficult for the same models that created them. In blind reviews, human experts rated the evolved problems as more novel, more difficult, and clearer than their source tasks, with a resulting 91-problem benchmark restoring clear performance discrimination among state-of-the-art models.

read4 min publishedJun 5, 2026

Frontier models now exceed 99% Pass@1 on LiveCodeBench easy, saturating our benchmarks. BenchEvolver automatically transforms existing coding problems into substantially harder, verified variants β€” tasks that are hard even for the model that created them.

On LiveCodeBench, state-of-the-art models exceed 99% Pass@1 on the newest easy split and over 90% on average. Building new, sufficiently hard datasets by hand is slow and expensive β€” a bottleneck for continued progress.

Most benchmark generation methods are problem-centric: they start by writing a new task and hope it requires new reasoning. In practice, this often produces surface-level variants of existing problems, while still relying on increasingly strong models to solve and validate them. BenchEvolver flips the direction. We evolve solutions first, then derive tasks from them. Because the reasoning structure changes before the problem statement is written, the resulting benchmarks impose genuinely new algorithmic demands while retaining executable ground truth by construction.

Mutate the reference solution to force a dominant algorithmic lift, then derive the statement and tests around the evolved, executable solution.

Brute-force triangulation and statement-faithfulness checks ensure statement, solution, and tests define the same task β€” not a single LLM judge.

Difficulty is measured, not assigned: a candidate is accepted only if a panel of target models empirically fails more than on the seed.

The surface story stays familiar; the underlying computation jumps to a different regime. The same solution-centric principle works across two very different coding domains.

Count arrays whose adjacent differences match the original and whose entries satisfy per-index bounds.

Now adjacent XORs must match. The feasible sets are no longer contiguous β€” interval intersection fails.

Implement a classical fourth-order Runge–Kutta integrator for a driven damped pendulum, returning the full state-space trajectory.

A Proposer evolves solutions and writes tasks; an Evaluator validates and measures empirical difficulty; a Memory module feeds accepted lineages and past failures back into search β€” turning repeated sampling into adaptive evolution.

Mutates the parent solution into a structurally different one, then derives a natural statement, public examples, and tiered hidden tests β€” all anchored by executing the evolved reference.

Triangulates the reference, a brute-force solver, and a statement-only oracle to catch inconsistencies; runs bounded repair; then accepts only if the target panel empirically fails more.

Local memory tracks each seed's lineage and error patterns; global memory enforces diversity across seeds β€” a family that already succeeded must clear a higher difficulty bar.

Across two domains, four target models, and multiple evolvers, evolved problems consistently and substantially reduce Pass@1 relative to their seeds. Crucially, each evolver also drops on its own evolved tasks β€” this is self-challenging generation, not teacher-to-student distillation.

Six competitive-programming experts (Codeforces master / IOI / ICPC level) blindly reviewed 207 evolved problems across 72 seeds. Evolved tasks are rated more novel and far more difficult, span a much broader algorithmic surface β€” and are actually rated clearer than their seeds.

A 91-problem benchmark combining 64 human-vetted evolved tasks with 27 difficult original LCB-v6 problems. Every problem passes correctness, quality (β‰₯3/5, Olympiad standard), and difficulty-range gates. Frontier Pass@1 spans 27.5%–62.6% β€” restoring clear discrimination among the strongest models.

Reasoning settings: GPT models use medium reasoning effort, DeepSeek uses high, and Gemini models use adaptive reasoning.

Model Medium seed Medium evolved Hard seed Hard evolved Ξ” Hard
GPT-5.5 100.0 80.0 97.1 62.3 βˆ’34.8
GPT-5.4 98.9 74.3 94.8 49.7 βˆ’45.1
GPT-5.4-mini 95.7 59.3 79.7 21.7 βˆ’58.0
Gemini-3.1-Pro 100.0 78.6 96.5 56.8 βˆ’39.7
DeepSeek-V4-Pro 95.7 57.1 83.7 23.2 βˆ’60.5

Averaged across all evaluated models, the Hard split drops from 87.0% β†’ 45.7% Pass@1 β€” an absolute reduction of 41.3 points.

Using gpt-oss-20b as both evolver and target, we evolve problems it already solves into harder verified variants, then train on them with RL. Evolved tasks improve held-out coding performance beyond training on the original seeds alone β€” the same model exposes and then learns from its own weaknesses. Any fixed benchmark eventually saturates. BenchEvolver points to a different model of evaluation: a reproducible pipeline that periodically generates, validates, and calibrates new tasks against current frontier models β€” aligning evaluation with training, so the same verified tasks that reveal failures also become the environments that fix them.

Difficulty is measured by executable model failure β€” including the generator's own. Frontier models can expose and train on their own weaknesses, not just distill from a larger model.

Only the execution harness is domain-specific. The same mutate β†’ write β†’ verify β†’ select loop works on stdin/stdout competitive programming and assertion-based scientific coding alike.

If you find our work useful, please consider citing:

── more in #artificial-intelligence 4 stories Β· sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/benchevolver-frontie…] indexed:0 read:4min 2026-06-05 Β· β€”