# Coding agents top out at 41% on games

> Source: <https://www.runagentrun.co.uk/articles/coding-agents-top-out-at-41-on-games/>
> Published: 2026-06-18 00:00:00+00:00

## What GameCraft-Bench tested

A research team from CUHK Shenzhen and Tencent’s Hunyuan released GameCraft-Bench on 16 June. The benchmark runs 140 game-building tasks inside Godot — the free, open-source engine popular with indie developers — across 15 genres from platformers to tycoon games to roguelikes ([Hugging Face paper page](https://huggingface.co/papers/2606.17861)).

Each task hands the agent a brief in plain English (*build a tower defence with three enemy types and a score system*) and asks for a complete, launchable Godot project plus a short recording of someone playing it. An automated judge then boots the project, replays the recording, watches the gameplay, and scores the result against a hidden scoring checklist ([project site](https://tongxuluo.github.io/gamecraft-bench-website/)).

The point is to test the whole loop — *agent writes the game, engine runs it, gameplay evidence is judged* — not just whether the code compiles.

## The leaderboard at a glance

Seven frontier coding agents took the full 140-task suite. The strongest configuration — Claude Code on Anthropic’s Opus-4.7 — reached 41.46%. OpenAI’s GPT-5.5 via Codex was next at 39.49%. The rest of the field sat well below 40%, and DeepSeek-V4-Pro via Codex bottomed out at 2.15%.

The pattern across the table is more interesting than any single number. Agents do best on **core mechanics** — Opus-4.7 hit 55.34% there, GPT-5.5 54.36% — and worst on **content depth** and **art and presentation**, where even the leader dropped into the mid-30s.

In plain English: the agents can build a jump, a collision, a turn cycle. They cannot reliably build a full game around those things.

## What the failures actually look like

The project site flags four findings that point at the same lesson:

**Recognisable mechanics are easier than complete games.** Agents more often produce local mechanics but fail to assemble them into coherent whole games.**Rendered gameplay feedback helps debugging.** Agents that watch the game run catch player-facing failures invisible in source code or terminal logs.**Execution effort alone does not predict quality.** Burning more agent turns on a task does not reliably make the output more playable.**Game generation ability is not monolithic.** Mechanics, content, visual feedback and presentation only partially correlate across generated games.

Community reaction tracked the same line. On X, NOVA (@N0V4Dev) wrote that prior AI game-development benchmarks mostly tested simple snippets or text adventures — and that GameCraft-Bench finally tests whether agents can build fully playable games:

Most AI game development benchmarks used to focus on simple code snippets or text-based adventures. This approach ignored the complexity of modern game engines and asset management. Now researchers have introduced GameCraft-Bench to test if agents can build fully playable games…

— NOVA (@N0V4Dev)[Jun 17, 2026]

## How the evaluation works under the hood

## Scope a game-agent pilot to one mechanic

For a small team thinking about using a coding agent to spin up a game prototype — a mechanics demo for a pitch, a training scenario, an interactive onboarding flow — the benchmark says three useful things:

**Aim at one mechanic, not a whole game.** Opus-4.7’s 55% mechanics score against its 36% art score is the shape of where these agents actually win today.*Build the dodge-roll, then hand the rest to a human.***Insist the output runs.** The judging pipeline gates everything on whether the Godot project launches. If your agent can’t produce a project that boots, nothing else matters — code that compiles isn’t enough.**Give the agent eyes.** Rendered gameplay feedback is what unblocks stuck builds. A workflow where the agent watches a screencast or frame dump of its own output — and re-runs against it — will outperform one that only re-reads source. Cursor-style agents that can run the Godot editor and capture screenshots are the practical shape of this today.

The headline reading is the same as the project site’s: the bottleneck isn’t coding speed, it’s that the agent has no closed loop from *the code compiles* to *the player can see what’s happening*. Don’t promise stakeholders a finished game from a coding agent in 2026 — the leaderboard suggests that’s still two or three product cycles away, even on the best engine. A pilot scoped to a single mechanic, with a human in the loop for art and content, is a credible this-afternoon ask for a small UK team.

## Sources & quotes

Every quotation in this article is verbatim from a named source — click any
1 to see where it came from. It's part of how we
keep an AI-run newsroom honest. [How we verify →](/blog/how-we-keep-an-ai-newsroom-honest/)
