DeepSWE Measuring frontier coding agents

DataCurve released DeepSWE, a new benchmark for evaluating frontier coding agents on original, long-horizon software engineering tasks. The benchmark features contamination-free tasks written from scratch across 91 repositories in five languages, requiring 5.5 times more code and twice the output tokens than existing benchmarks like SWE-bench Pro. DeepSWE aims to differentiate top-performing coding models that currently cluster within narrow score bands on saturated public benchmarks.

DeepSWE by https://datacurve.ai Measuring frontier coding agents on original, long-horizon engineering tasks Today's leading public coding benchmarks are starting to saturate at the frontier: top models cluster within a narrow score band where adjacent configurations often overlap on confidence intervals. DeepSWE is a long-horizon software engineering benchmark built to separate them. It delivers four advances over existing public benchmarks: Contamination free : Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining. High diversity : Tasks span a broad pool of 91 repositories across 5 languages. Real-world complexity : Prompts are ~half the length of SWE-bench Pro's, yet solutions require 5.5x more code and ~2x more output tokens. Reliable verification : Verifiers are hand-written to test software behavior rather than implementation details. The result is a benchmark that reflects how today's frontier coding agents actually perform in software engineering work. Leaderboard All models are run with mini-swe-agent https://github.com/SWE-agent/mini-swe-agent ; see Why mini-swe-agent /blog evaluation-harness for a comparison against other harnesses. Task Examples capricorn86/happy-domtypescript /data/tasks/happy-dom-abort-pending-body-reads Abort pending body reads on shutdown Ensure interrupted request and response body reads, formData parsing, and discarded timers abort cleanly during shutdown. prometheus/prometheusgo /data/tasks/prometheus-typed-label-sorting Fix PromQL label sorting across typed and untyped values PromQL label sorting must order mixed typed and untyped label values with stable typed comparison rules. c4spar/cliffytypescript /data/tasks/cliffy-config-file-parsing Add config file parsing to Cliffy commands Add command-level config file loading, parsing, merging, and precedence handling. yjs/yjsjavascript /data/tasks/yjs-map-conflict-detection Add deterministic map conflict detection to Y.Map writes Add strict, deterministic conflict detection for Y.Map key writes with collect and error policies. wasmi-labs/wasmirust /data/tasks/wasmi-trap-coredumps Add trap coredump generation to wasmi Generate opt-in Wasm coredumps on traps and attach the bytes to errors. beevik/etreego /data/tasks/etree-xml-diff-patch Add XML diff, patch, and merge operations to etree Add recursive XML diffing, patch generation and application, reverse patching, three-way merge, and diff summaries. All 113 tasks /data/tasks