{"slug": "deepswe-measuring-frontier-coding-agents", "title": "DeepSWE Measuring frontier coding agents", "summary": "DataCurve released DeepSWE, a new benchmark for evaluating frontier coding agents on original, long-horizon software engineering tasks. The benchmark features contamination-free tasks written from scratch across 91 repositories in five languages, requiring 5.5 times more code and twice the output tokens than existing benchmarks like SWE-bench Pro. DeepSWE aims to differentiate top-performing coding models that currently cluster within narrow score bands on saturated public benchmarks.", "body_md": "# DeepSWE\n\n[by](https://datacurve.ai)\n\nMeasuring frontier coding agents on original, long-horizon engineering tasks\n\nToday's leading public coding benchmarks are starting to saturate at the frontier: top models cluster within a narrow score band where adjacent configurations often overlap on confidence intervals. DeepSWE is a long-horizon software engineering benchmark built to separate them. It delivers four advances over existing public benchmarks:\n\n**Contamination free**: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.**High diversity**: Tasks span a broad pool of 91 repositories across 5 languages.** Real-world complexity**: Prompts are ~half the length of SWE-bench Pro's, yet solutions require 5.5x more code and ~2x more output tokens.** Reliable verification**: Verifiers are hand-written to test software behavior rather than implementation details.\n\nThe result is a benchmark that reflects how today's frontier coding agents actually perform in software engineering work.\n\n## Leaderboard\n\nAll models are run with [mini-swe-agent](https://github.com/SWE-agent/mini-swe-agent); see [Why mini-swe-agent](/blog#evaluation-harness) for a comparison against other harnesses.\n\n## Task Examples\n\n[capricorn86/happy-domtypescript](/data/tasks/happy-dom-abort-pending-body-reads)\n\n### Abort pending body reads on shutdown\n\nEnsure interrupted request and response body reads, formData parsing, and discarded timers abort cleanly during shutdown.\n\n[prometheus/prometheusgo](/data/tasks/prometheus-typed-label-sorting)\n\n### Fix PromQL label sorting across typed and untyped values\n\nPromQL label sorting must order mixed typed and untyped label values with stable typed comparison rules.\n\n[c4spar/cliffytypescript](/data/tasks/cliffy-config-file-parsing)\n\n### Add config file parsing to Cliffy commands\n\nAdd command-level config file loading, parsing, merging, and precedence handling.\n\n[yjs/yjsjavascript](/data/tasks/yjs-map-conflict-detection)\n\n### Add deterministic map conflict detection to Y.Map writes\n\nAdd strict, deterministic conflict detection for Y.Map key writes with collect and error policies.\n\n[wasmi-labs/wasmirust](/data/tasks/wasmi-trap-coredumps)\n\n### Add trap coredump generation to wasmi\n\nGenerate opt-in Wasm coredumps on traps and attach the bytes to errors.\n\n[beevik/etreego](/data/tasks/etree-xml-diff-patch)\n\n### Add XML diff, patch, and merge operations to etree\n\nAdd recursive XML diffing, patch generation and application, reverse patching, three-way merge, and diff summaries.\n\n[All 113 tasks](/data/tasks)", "url": "https://wpnews.pro/news/deepswe-measuring-frontier-coding-agents", "canonical_source": "https://deepswe.datacurve.ai/", "published_at": "2026-05-27 19:57:16+00:00", "updated_at": "2026-05-27 20:16:15.903056+00:00", "lang": "en", "topics": ["ai-agents", "large-language-models", "ai-research", "ai-tools", "ai-infrastructure"], "entities": ["DeepSWE", "DataCurve", "SWE-bench", "mini-swe-agent", "capricorn86", "happy-dom"], "alternates": {"html": "https://wpnews.pro/news/deepswe-measuring-frontier-coding-agents", "markdown": "https://wpnews.pro/news/deepswe-measuring-frontier-coding-agents.md", "text": "https://wpnews.pro/news/deepswe-measuring-frontier-coding-agents.txt", "jsonld": "https://wpnews.pro/news/deepswe-measuring-frontier-coding-agents.jsonld"}}