{"slug": "benchevolver-frontier-task-synthesis-via-solution-centric-evolution", "title": "BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution", "summary": "Researchers have developed BenchEvolver, a system that automatically generates harder coding benchmarks by first evolving solution algorithms and then deriving new problem statements from them. The method addresses benchmark saturation, where frontier models now exceed 99% Pass@1 on easy splits of LiveCodeBench, by producing tasks that are empirically verified to be more difficult for the same models that created them. In blind reviews, human experts rated the evolved problems as more novel, more difficult, and clearer than their source tasks, with a resulting 91-problem benchmark restoring clear performance discrimination among state-of-the-art models.", "body_md": "Frontier models now exceed **99% Pass@1** on LiveCodeBench easy, saturating our benchmarks. BenchEvolver automatically transforms existing coding problems into substantially harder, verified variants — tasks that are hard even for the model that created them.\n\nOn LiveCodeBench, state-of-the-art models exceed **99% Pass@1** on the newest easy split and over **90%** on average. Building new, sufficiently hard datasets by hand is slow and expensive — a bottleneck for continued progress.\n\nMost benchmark generation methods are problem-centric: they start by writing a new task and hope it requires new reasoning. In practice, this often produces surface-level variants of existing problems, while still relying on increasingly strong models to solve and validate them. BenchEvolver flips the direction. We evolve solutions first, then derive tasks from them. Because the reasoning structure changes before the problem statement is written, the resulting benchmarks impose genuinely new algorithmic demands while retaining executable ground truth by construction.\n\nMutate the reference solution to force a **dominant algorithmic lift**, then derive the statement and tests around the evolved, executable solution.\n\nBrute-force triangulation and statement-faithfulness checks ensure statement, solution, and tests define **the same task** — not a single LLM judge.\n\nDifficulty is **measured**, not assigned: a candidate is accepted only if a panel of target models empirically fails more than on the seed.\n\nThe surface story stays familiar; the underlying computation jumps to a different regime. The same solution-centric principle works across two very different coding domains.\n\nCount arrays whose **adjacent differences** match the original and whose entries satisfy per-index bounds.\n\nNow adjacent **XORs** must match. The feasible sets are **no longer contiguous** — interval intersection fails.\n\nImplement a classical fourth-order Runge–Kutta integrator for a driven damped pendulum, returning the full state-space trajectory.\n\n# given f, state, dt, n ... runge_kutta_4th_order(...) # integrate forward → trajectoryEstimate the **unknown initial state and ODE parameters** from sparse observations — turning integration into a full nonlinear solver.\n\nA Proposer evolves solutions and writes tasks; an Evaluator validates and measures empirical difficulty; a Memory module feeds accepted lineages and past failures back into search — turning repeated sampling into adaptive evolution.\n\nMutates the parent solution into a structurally different one, then derives a natural statement, public examples, and tiered hidden tests — all anchored by **executing** the evolved reference.\n\nTriangulates the reference, a brute-force solver, and a statement-only oracle to catch inconsistencies; runs bounded repair; then accepts only if the target panel empirically fails more.\n\n**Local** memory tracks each seed's lineage and error patterns; **global** memory enforces diversity across seeds — a family that already succeeded must clear a higher difficulty bar.\n\nAcross two domains, four target models, and multiple evolvers, evolved problems consistently and substantially reduce Pass@1 relative to their seeds. Crucially, each evolver also drops on its own evolved tasks — this is self-challenging generation, not teacher-to-student distillation.\n\nSix competitive-programming experts (Codeforces master / IOI / ICPC level) blindly reviewed 207 evolved problems across 72 seeds. Evolved tasks are rated more novel and far more difficult, span a much broader algorithmic surface — and are actually rated *clearer* than their seeds.\n\nA 91-problem benchmark combining 64 human-vetted evolved tasks with 27 difficult original LCB-v6 problems. Every problem passes correctness, quality (≥3/5, Olympiad standard), and difficulty-range gates. Frontier Pass@1 spans 27.5%–62.6% — restoring clear discrimination among the strongest models.\n\nReasoning settings: GPT models use **medium** reasoning effort, DeepSeek uses **high**, and Gemini models use **adaptive** reasoning.\n\n| Model | Medium seed | Medium evolved | Hard seed | Hard evolved | Δ Hard |\n|---|---|---|---|---|---|\n| GPT-5.5 | 100.0 | 80.0 | 97.1 | 62.3 | −34.8 |\n| GPT-5.4 | 98.9 | 74.3 | 94.8 | 49.7 | −45.1 |\n| GPT-5.4-mini | 95.7 | 59.3 | 79.7 | 21.7 | −58.0 |\n| Gemini-3.1-Pro | 100.0 | 78.6 | 96.5 | 56.8 | −39.7 |\n| DeepSeek-V4-Pro | 95.7 | 57.1 | 83.7 | 23.2 | −60.5 |\n\nAveraged across all evaluated models, the Hard split drops from 87.0% → 45.7% Pass@1 — an absolute reduction of 41.3 points.\n\nUsing gpt-oss-20b as both evolver and target, we evolve problems it already solves into harder verified variants, then train on them with RL. Evolved tasks improve held-out coding performance **beyond training on the original seeds alone** — the same model exposes and then learns from its own weaknesses.\n\nAny fixed benchmark eventually saturates. BenchEvolver points to a different model of evaluation: a reproducible pipeline that periodically generates, validates, and calibrates new tasks against current frontier models — aligning evaluation with training, so the same verified tasks that reveal failures also become the environments that fix them.\n\nDifficulty is measured by executable model failure — including the generator's own. Frontier models can expose and train on their own weaknesses, not just distill from a larger model.\n\nOnly the execution harness is domain-specific. The same mutate → write → verify → select loop works on stdin/stdout competitive programming and assertion-based scientific coding alike.\n\nIf you find our work useful, please consider citing:", "url": "https://wpnews.pro/news/benchevolver-frontier-task-synthesis-via-solution-centric-evolution", "canonical_source": "https://benchevolver.github.io/", "published_at": "2026-06-05 23:39:52+00:00", "updated_at": "2026-06-05 23:47:46.697774+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "ai-research"], "entities": ["BenchEvolver", "LiveCodeBench"], "alternates": {"html": "https://wpnews.pro/news/benchevolver-frontier-task-synthesis-via-solution-centric-evolution", "markdown": "https://wpnews.pro/news/benchevolver-frontier-task-synthesis-via-solution-centric-evolution.md", "text": "https://wpnews.pro/news/benchevolver-frontier-task-synthesis-via-solution-centric-evolution.txt", "jsonld": "https://wpnews.pro/news/benchevolver-frontier-task-synthesis-via-solution-centric-evolution.jsonld"}}