# DeepSWE Measuring frontier coding agents

> Source: <https://deepswe.datacurve.ai/>
> Published: 2026-05-27 19:57:16+00:00

# DeepSWE

[by](https://datacurve.ai)

Measuring frontier coding agents on original, long-horizon engineering tasks

Today's leading public coding benchmarks are starting to saturate at the frontier: top models cluster within a narrow score band where adjacent configurations often overlap on confidence intervals. DeepSWE is a long-horizon software engineering benchmark built to separate them. It delivers four advances over existing public benchmarks:

**Contamination free**: Tasks are written from scratch, not adapted from existing commits or PRs, so no model has seen the solution during pretraining.**High diversity**: Tasks span a broad pool of 91 repositories across 5 languages.** Real-world complexity**: Prompts are ~half the length of SWE-bench Pro's, yet solutions require 5.5x more code and ~2x more output tokens.** Reliable verification**: Verifiers are hand-written to test software behavior rather than implementation details.

The result is a benchmark that reflects how today's frontier coding agents actually perform in software engineering work.

## Leaderboard

All models are run with [mini-swe-agent](https://github.com/SWE-agent/mini-swe-agent); see [Why mini-swe-agent](/blog#evaluation-harness) for a comparison against other harnesses.

## Task Examples

[capricorn86/happy-domtypescript](/data/tasks/happy-dom-abort-pending-body-reads)

### Abort pending body reads on shutdown

Ensure interrupted request and response body reads, formData parsing, and discarded timers abort cleanly during shutdown.

[prometheus/prometheusgo](/data/tasks/prometheus-typed-label-sorting)

### Fix PromQL label sorting across typed and untyped values

PromQL label sorting must order mixed typed and untyped label values with stable typed comparison rules.

[c4spar/cliffytypescript](/data/tasks/cliffy-config-file-parsing)

### Add config file parsing to Cliffy commands

Add command-level config file loading, parsing, merging, and precedence handling.

[yjs/yjsjavascript](/data/tasks/yjs-map-conflict-detection)

### Add deterministic map conflict detection to Y.Map writes

Add strict, deterministic conflict detection for Y.Map key writes with collect and error policies.

[wasmi-labs/wasmirust](/data/tasks/wasmi-trap-coredumps)

### Add trap coredump generation to wasmi

Generate opt-in Wasm coredumps on traps and attach the bytes to errors.

[beevik/etreego](/data/tasks/etree-xml-diff-patch)

### Add XML diff, patch, and merge operations to etree

Add recursive XML diffing, patch generation and application, reverse patching, three-way merge, and diff summaries.

[All 113 tasks](/data/tasks)
