Show HN: Rogue-Bench – LLMs play the game Rogue

A new benchmark called Rogue-Bench tests how well large language models can play the classic dungeon crawler game Rogue. The tool runs a modified headless version of Unix Rogue 5.4.2, communicating with the game over pipes to parse terminal output and send keystrokes. Rogue-Bench accumulates statistics and logs for post-hoc analysis, enabling researchers to evaluate LLM gameplay performance.

Rogue-Bench Rogue-Bench is a benchmark where agents play Rogue . Specifically, how well LLMs can play the classic dungeon crawler. This work would not be possible without Rogue Collection https://github.com/mikeyk730/Rogue-Collection . If you just want to play Rogue, head over there. Once set up, you should be able to produce a result like this: GPT-5.4-mini playing Rogue. Get started ¶ get-started Note Rogue-Bench compilation and runs have been tested on WSL2 Ubuntu 24.04. If you are struggling to get something working locally, try the Docker setup. Local ¶ local To run locally, execute: git clone --recursive https://github.com/iwhalen/rogue-bench.git cd rogue-bench make install Install system level dependencies make build Compile the custom headless Rogue executable uv run rogue-bench --player human This will start a "human" session where you can control Rogue with keyboard inputs. This is a good sanity check before setting up a real agent. For all command line options, see: uv run rogue-bench --help For more on the Rogue-Bench CLI, see here cli/ . Docker ¶ docker To run Rogue in Docker, execute: git clone --recursive https://github.com/iwhalen/rogue-bench.git cd rogue-bench make build-docker uv run rogue-bench --docker-image rogue-bench --player human Again, this will start in "human" mode. How it works ¶ how-it-works Rogue-Bench runs a slightly modified, headless Rogue executable and communicates with it over pipes. The Python library reads Rogue's terminal output, optionally parses it into a screen state, and sends keystrokes back to the game. No Rogue gameplay elements have been changed. Specifically, the version of Rogue is fixed to Unix Rogue 5.4.2. Runs will accumulate statistics, metadata, and log keystrokes. This allows post-hoc analysis as well as the ability to replay an entire run. For more specifics on the implementation, see the Github repository https://github.com/iwhalen/rogue-bench . License ¶ license Note that the Python code for running Rogue-Bench is offered under the GPL-3.0 license. The modified Rogue executables are under the same license s as the Rogue Collection https://github.com/mikeyk730/Rogue-Collection . At the time of writing, this is a mix of GPL-3.0 and other licenses. Rogue is a trademark of Epyx, Inc. Rogue-Bench is not associated with Epyx in any way.