DeepSWE Measuring frontier coding agents

wpnews.pro

cd /news/ai-agents/deepswe-measuring-frontier-coding-ag… · home › topics › ai-agents › article

[ARTICLE · art-15715] src=deepswe.datacurve.ai ↗ pub=2026-05-27T19:57Z topic=ai-agents verified=true sentiment=· neutral

DeepSWE Measuring frontier coding agents

DataCurve released DeepSWE, a new benchmark for evaluating frontier coding agents on original, long-horizon software engineering tasks. The benchmark features contamination-free tasks written from scratch across 91 repositories in five languages, requiring 5.5 times more code and twice the output tokens than existing benchmarks like SWE-bench Pro. DeepSWE aims to differentiate top-performing coding models that currently cluster within narrow score bands on saturated public benchmarks.

read2 min views16 publishedMay 27, 2026

by Measuring frontier coding agents on original, long-horizon engineering tasks

Today's leading public coding benchmarks are starting to saturate at the frontier: top models cluster within a narrow score band where adjacent configurations often overlap on confidence intervals. DeepSWE is a long-horizon software engineering benchmark built to separate them. It delivers four advances over existing public benchmarks:

Contamination free: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.High diversity: Tasks span a broad pool of 91 repositories across 5 languages.** Real-world complexity**: Prompts are ~half the length of SWE-bench Pro's, yet solutions require 5.5x more code and ~2x more output tokens.** Reliable verification**: Verifiers are hand-written to test software behavior rather than implementation details.

The result is a benchmark that reflects how today's frontier coding agents actually perform in software engineering work.

Leaderboard #

All models are run with mini-swe-agent; see Why mini-swe-agent for a comparison against other harnesses.

Task Examples #

capricorn86/happy-domtypescript

Abort pending body reads on shutdown

Ensure interrupted request and response body reads, formData parsing, and discarded timers abort cleanly during shutdown.

prometheus/prometheusgo

Fix PromQL label sorting across typed and untyped values

PromQL label sorting must order mixed typed and untyped label values with stable typed comparison rules.

c4spar/cliffytypescript

Add config file parsing to Cliffy commands

Add command-level config file , parsing, merging, and precedence handling.

yjs/yjsjavascript

Add deterministic map conflict detection to Y.Map writes

Add strict, deterministic conflict detection for Y.Map key writes with collect and error policies.

wasmi-labs/wasmirust

Add trap coredump generation to wasmi

Generate opt-in Wasm coredumps on traps and attach the bytes to errors.

beevik/etreego

Add XML diff, patch, and merge operations to etree

Add recursive XML diffing, patch generation and application, reverse patching, three-way merge, and diff summaries.

All 113 tasks

source & further reading

deepswe.datacurve.ai — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/deepswe-measuring-fronti…

Read original on deepswe.datacurve.ai → deepswe.datacurve.ai/

mentioned entities

DeepSWE

DataCurve

SWE-bench

mini-swe-agent

capricorn86

happy-dom

metadata

slugdeepswe-measuring-frontier-coding-agents

topic#ai-agents

secondary4 topics

sentimentneutral

canonicaldeepswe.datacurve.ai

navigation

← prevArgus: Omnidirectional robot wit…

next →Control Plane Sovereignty: Why Y…

── more in #ai-agents 4 stories · sorted by recency

dev.to · 12 Jul · #ai-agents

Claude Shortcuts !

pub.towardsai.net · 12 Jul · #ai-agents

I Built an AI Agent Team That Fixes Its Own Mistakes — Here’s the Full, Tested Code

deepswe.datacurve.ai · 19 Jun · #ai-agents

DeepSWE v1.1

github.com · 12 Jul · #ai-agents

Multi-agent network: Own the node. Let the agents build together

── more on @deepswe 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 21 May · #developer-tools

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

wpnews · 8 Jul · #artificial-intelligence

SpaceXAI unveils Grok 4.5 AI model ahead of July 2026 public release

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required