MirrorCode: What's the largest software project AI can complete on its own?

wpnews.pro

cd /news/artificial-intelligence/mirrorcode-what-s-the-largest-softwa… · home › topics › artificial-intelligence › article

[ARTICLE · art-41267] src=epoch.ai ↗ pub=2026-06-26T20:14Z topic=artificial-intelligence verified=true sentiment=· neutral

MirrorCode: What's the largest software project AI can complete on its own?

Epoch AI and METR released MirrorCode, a benchmark testing AI models on reimplementing entire programs from scratch. Claude Opus 4.7 achieved a 56% score, solving a bioinformatics toolkit in 14 hours for $251, a task estimated to take humans 2–17 weeks. The benchmark reveals rapid AI improvement but significant room for progress.

read4 min views1 publishedJun 26, 2026

MirrorCode: What's the largest software project AI can complete on its own? — Image: source

AI has made rapid progress on software engineering benchmarks in the past few years. However, most such benchmarks tend to focus on shorter tasks like fixing bugs or implementing individual features. MirrorCode is our benchmark, co-developed with METR, to test AI models on long-horizon coding tasks. In a MirrorCode task, AI models are tasked with reimplementing an entire program end-to-end, without access to the original source code. AI-generated solutions must match the original program’s output exactly on end-to-end tests, including held-out tests. MirrorCode’s 25 target programs span different areas of computing: Unix utilities, data serialization and query tools, bioinformatics, interpreters, static analysis, cryptography, and compression.

How MirrorCode is different #

Scale-aware evaluations

Crucially, we provide a large enough inference budget to make a serious attempt at MirrorCode tasks. Many existing software engineering benchmarks limit inference spending to around $1–10, even when the task would take weeks for a human to complete. For example, one of the largest MirrorCode tasks cost $2,600 for a single run and involved AI working for 19 days without human intervention.

Difficult, but fair

Reimplementing entire programs is extremely challenging for human software engineers. We believe a human engineer without AI would take months to solve the most complex MirrorCode tasks. However, MirrorCode tasks are also feasible; we know that there is enough information for the tasks to be fair.

Cheat-resistant by design

We sandbox AI models, requiring them to conduct their work without access to the internet, without access to the original codebase, and with no way to cheat on the task. There are end-to-end tests that models never see while developing their code, so they cannot simply create a lookup table to mimic the original program's outputs.

AI can already perform some long-horizon coding tasks #

AI can already solve long-horizon MirrorCode tasks, despite their difficulty. For example, Claude Opus 4.7 reimplemented gotree: a bioinformatics toolkit with ~16,000 lines of Go and 40+ commands. 1 We believe this same task would take a human engineer without AI assistance 2–17 weeks. Opus 4.7 solved it in 14 hours, costing $251.

However, MirrorCode is not fully solved. Claude Opus 4.7’s headline score is only 56%, meaning there is significant room for further improvement. 2 We look forward to evaluating new models on the benchmark.

We also found that AI models are improving rapidly over time. Leading models from a year ago would have scored about 30%, and were limited to simpler programs, such as a calendar utility. There was no clear overall trend in cost: GPT-5.5 cost 3× more than GPT-5 to solve the same tasks, whereas Claude Opus 4.7 was 3× cheaper than Claude Opus 4.1.

One important caveat to these results is data contamination. Because MirrorCode tasks involve reimplementing open-source programs, AI models are likely to have seen the original codebases in pretraining. This might lead to inflated performance on the benchmark. However, AI successfully reimplemented several target programs that passed our memorization screen, and failed to reimplement programs where the screen showed evidence of memorization. This suggests that the results were not dominated by memorization, but we cannot rule out the possibility that memorization contributes to AI performance. Overall, we expect that the capabilities measured by MirrorCode would generalize to an unseen codebase. We discuss this further, along with more results and details on benchmark construction, in the paper.

Open-source code #

We release our scaffold and 22 of the 25 MirrorCode target programs (totaling 132 task instances across the six supported programming languages) as open-source, with the other three targets held out as a private test set.

This work was co-developed with METR and supported by a grant from METR. The authors of MirrorCode are Tom Adamczewski, David Owen, and David Rein. Florian Brand, Giles Edkins, Allen Hart, and Daniel O’Connell contributed additional target programs. Rasmus Faber-Espensen made crucial infrastructure improvements and gave advice on engineering

The best-scoring AI gotree implementations passed 2000/2001 tests, but failed a single edge-case test for a niche command to manipulate date annotations. Consequently, they do not strictly solve the task to 100% completion, but we consider the reimplementation near-perfect, covering essentially all scoped functionality.

On 21/25 MirrorCode targets, AI models have at least once passed 99% of tests or more. Typically, outstanding test failures are from a handful of edge cases. At the stricter threshold of reimplementation (100% of tests passing), eight MirrorCode targets have never been solved in any run. Benchmark scores are lower than 17/25 ≈ 70% because several targets are not solved reliably: AI solves them only in some runs.

source & further reading

epoch.ai — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/mirrorcode-what-s-the-la…

Read original on epoch.ai → epoch.ai/MirrorCode

mentioned entities

Epoch AI

METR

Claude Opus 4.7

MirrorCode