cd /news/large-language-models/show-hn-rogue-bench-llms-play-the-ga… · home topics large-language-models article
[ARTICLE · art-14407] src=iwhalen.github.io pub= topic=large-language-models verified=true sentiment=· neutral

Show HN: Rogue-Bench – LLMs play the game Rogue

A new benchmark called Rogue-Bench tests how well large language models can play the classic dungeon crawler game Rogue. The tool runs a modified headless version of Unix Rogue 5.4.2, communicating with the game over pipes to parse terminal output and send keystrokes. Rogue-Bench accumulates statistics and logs for post-hoc analysis, enabling researchers to evaluate LLM gameplay performance.

read2 min publishedMay 26, 2026

Rogue-Bench is a benchmark where agents play [Rogue]. Specifically, how well LLMs can play the classic dungeon crawler.

This work would not be possible without Rogue Collection. If you just want to play Rogue, head over there.

Once set up, you should be able to produce a result like this:

GPT-5.4-mini playing Rogue.

Get started #

Note

Rogue-Bench compilation and runs have been tested on (WSL2) Ubuntu 24.04. If you are struggling to get something working locally, try the Docker setup.

Local

To run locally, execute:

git clone --recursive https://github.com/iwhalen/rogue-bench.git 
cd rogue-bench
make install  # Install system level dependencies
make build    # Compile the custom headless Rogue executable
uv run rogue-bench --player human

This will start a "human" session where you can control Rogue with keyboard inputs. This is a good sanity check before setting up a real agent.

For all command line options, see:

uv run rogue-bench --help

For more on the Rogue-Bench CLI, see here.

Docker

To run Rogue in Docker, execute:

git clone --recursive https://github.com/iwhalen/rogue-bench.git
cd rogue-bench
make build-docker
uv run rogue-bench --docker-image rogue-bench --player human

Again, this will start in "human" mode.

How it works #

Rogue-Bench runs a slightly modified, headless Rogue executable and communicates with it over pipes. The Python library reads Rogue's terminal output, (optionally) parses it into a screen state, and sends keystrokes back to the game.

No Rogue gameplay elements have been changed. Specifically, the version of Rogue is fixed to Unix Rogue 5.4.2.

Runs will accumulate statistics, metadata, and log keystrokes. This allows post-hoc analysis as well as the ability to replay an entire run.

For more specifics on the implementation, see the Github repository.

License #

Note that the Python code for running Rogue-Bench is offered under the GPL-3.0 license.

The modified Rogue executables are under the same license(s) as the Rogue Collection. At the time of writing, this is a mix of GPL-3.0 and other licenses.

Rogue is a trademark of Epyx, Inc. Rogue-Bench is not associated with Epyx in any way.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/show-hn-rogue-bench-…] indexed:0 read:2min 2026-05-26 ·