# DeepSWE v1.1

> Source: <https://deepswe.datacurve.ai/blog/deepswe-v1-1>
> Published: 2026-06-19 02:35:45+00:00

# DeepSWE v1.1

Updated execution and grading for the same long-horizon engineering tasks.

Wenqi Huang, Peter Jiang · June 14, 2026

DeepSWE v1.1 keeps the same long-horizon engineering tasks as [v1](/blog/deepswe), but updates how agents are executed and scored by grading their committed code in a clean, isolated environment, making results easier to reproduce, audit, and analyze. We also fixed dependency drift and removed flaky tests on some tasks.

With the updated setup, now including Claude Fable 5 and Kimi K2.7 Code, aggregate pass rates and model ordering remain close to v1.

Note: 73 of Claude Fable 5's 2,260 trials did not complete due to [access being suspended by a US government directive](https://www.anthropic.com/news/fable-mythos-access) partway through our sweep. Pass rates are computed over the completed trials.

Wall-clock time is no longer reported as it is highly dependent on external variables like host machine performance and provider load, making it an inconsistent metric.

## Explore

View the benchmark on GitHub, browse every rollout behind the numbers above, or run your own agent against the benchmark.

[Browse trajectories](/data/v1.1/trials)[Run DeepSWE](/run)[GitHub](https://github.com/datacurve-ai/deep-swe)

## What changed

**Isolated Verification:** The agent commits its proposed changes, and we extract the git patch to evaluate in an isolated container, separate from where the agent worked. This follows the same approach as SWE-bench, and keeps grading independent of the agent's runtime environment for reproducible results.**Structured test reports:** Tests now emit a CTRF report, recording each test that defines a task by name and status. This gives us a per-test view of what passed and failed, useful for analyzing results and spotting partial progress on a task.**Natural Git Environment:** Rather than operating in detached HEAD mode, we set the`main`

branch to the task's starting commit and ensure no future commits are visible. This enables agents to work more naturally from the`main`

branch, formulate feature branches, and explicitly commit its changes as it would in normal development.

To elaborate on the git environment: one concern with our previous environment construction method was that if an implementation similar to a task had been merged upstream, the agent could potentially cheat by finding it through `git log`

. We conducted a sweep of the tasks' upstream repos to check whether any had implementations similar to our tasks as of June 5th. We found no such instances, meaning results from v1.0 remain free of this form of cheating.

Together, these changes make tasks harder to game. Because we grade only the committed patch in a separate container, some easier shortcuts no longer work: an agent can't monkey-patch the test framework, and because the CTRF report records each task-defining test by name, dropping tests or forcing an early exit shows up as missing or failed results rather than a pass.

## Impact on Results

The chart below compares each model's pass rate under v1 and v1.1. Scores stay close: the ordering at the top is unchanged, and most configurations land within a few points of their v1 result.

The 10 configurations run in both versions. Hollow dot is v1, filled dot is v1.1; the change in points is at right.

## Citation

Please cite this work as:

```
@misc{datacurve2026deepswev11,
  title  = {DeepSWE v1.1: a cleaner, more reproducible benchmark for frontier coding agents},
  author = {Wenqi Huang and Peter Jiang},
  year   = {2026},
  url    = {https://github.com/datacurve-ai/deep-swe},
}
```

## Sign up for leaderboard updates

New frontier models are added to the DeepSWE leaderboard as they're released.
