cd /news/large-language-models/mtg-bench-testing-how-well-llms-can-… · home topics large-language-models article
[ARTICLE · art-24644] src=mtgautodeck.com pub= topic=large-language-models verified=true sentiment=· neutral

MTG Bench: Testing how well LLMs can play Magic

A new benchmark called MTG Bench tests how well large language models can play Magic: The Gathering by allowing them to control game actions through an MCP server with primitive library operations. The benchmark reveals that while LLMs can successfully execute complex turns involving scry, discover, and tutor effects, they struggle with legal play and are better at evaluating turn legality than performing legal simulations. The benchmark also exposes significant cost differences between AI providers, with OpenAI charging input tokens only once per agent loop while Anthropic charges for the system prompt after every tool call.

read5 min publishedJun 11, 2026

Results #

Click on the charts above to view each benchmark's simulations.

Example successes

Fable 5 plays a scry land and looks at the top card of the deck - Gemini 3.5 flash performs complex turn with scry, discover, and tutor effects

Example failures

How the benchmark works #

The main idea is that if an LLM is smart enough to play good magic, then it is also smart enough to not need a rules engine. A rules engine that enforces legal actions would improve the performance floor, but I don't think it would improve the overall quality of the simulation.

Each LLM call has access to an MCP server with primitive library operations. It can do things like draw a card from the top of the deck, return card to bottom of deck, and shuffle. To simulate more advanced operations, like scry or surveil, it can use multiple library tool calls.

Everything other than the library is managed by the LLM. Legality checks and scoring for the benchmarks was all done with gpt-5.5 (medium). From my testing, LLMs were much better at evaluating if a simulated turn was legal than they were at actually performing a legal turn simulation.

Why I choose to use an MCP server #

I have full control over all of the data and the LLM api calls, so why use MCP instead of basic function/tool calling?

The main reason is that OpenAI and Anthropic allow you to provide a remote MCP server url in an api request. This means that OpenAI or Anthropic handle the agent loop. This has two major benefits.

  • Since it is one api call, you don't pay for the cached input token cost after each tool use (at least with OpenAI. more on that later)
  • You can use the batch api for 50% savings without having to submit a new batch after every tool call

Input token caching #

In my opinion, the way cached input tokens are charged does not make sense for agent loops. The pricing makes sense for independent requests. If multiple independent api calls start with the same large system prompt, input caching gets you a discount for free, or for a small caching fee.

With an agent loop, however, you are charged the cached input cost for a large system prompt after every tool call. Consider an example. Assume the system prompt is already cached and tool calls result in negligible token use.

- Large system prompt = 10k tokens
- Agent calls 10 tool functions (not parallel)
- Billed cached input tokens = 10k + 10k * 10 = 110k tokens

I don't think it makes sense to charge for the system prompt after every agent turn if the LLM is only pausing for a fraction of a second while waiting for a tool function result. This is overlooking some details, like how it takes output tokens to call a tool, and the tool function result still needs to be processed as input tokens. But in my case, the api cost is dominated by the large system prompt being charged as cached input tokens after every agent turn.

The pricing for an agent loop is understandable when your application code has the agent loop, and is making a new api call after each tool call. But it makes even less sense when you provide a remote MCP server and do not handle the agent loop yourself. OpenAI handles it correctly. A single api call to OpenAI with a remote MCP server will only ever charge you for the input prompt once. An Anthropic api call with remote MCP server, however, works like the previous example.

Some real numbers, the gpt-5.5 (medium) benchmark had an average input tokens per magic turn of 11,386. The average for claude-fable-5 (medium) was 51,610.

Over eager tool calling #

This benchmark punishes models that are too eager to call tools more than most benchmarks. In many cases, tool calls are only retrieving information, so if a model calls too many tools, the only downside is wasted input tokens and context window for the tool results. Even if the tool mutates state, it can usually be undone so the final result is correct.

This is not the case when simulation magic. If you draw a card, then realize that was a mistake, you can't just put it back. Even if you do return the card to the deck, you now know what that card is, so the simulation is illegal.

A common failure mode was the model starting a tool call, then realizing it was a mistake and having no way to correct it. All the library MCP functions have a required reason field. If you look at this example from Opus 4.8, you can see that it draws a card for turn with reason "Draw for turn", then returns the card to the deck with reason "No-op check not needed; cancel". It then proceeds to return a card named "x" to the deck with reason "noop", then again with reason "stop".

What's next #

I made MTG Auto Deck as a way to try out vibe coding. I had not been keeping up with the state of LLM based coding, and I ended up making this project and the benchmark without writing a single line of code by hand.

I only made a live version with accounts and payments because of how quick it was to implement. The

[project is on GitHub](https://github.com/CallumFerguson/mtg-auto-deck)
if you want to run it and use your own api keys or local llama.cpp

I wouldn't actually recommend paying for the live version. With the current cost and speed of models that that can accurately play magic, the app does not provide much utility. Simulating turns one at a time is slower than manually goldfishing your deck using one of the online tools. And it is too expensive to run dozens of simulations in parallel and give you a summary.

As better cheaper LLMs get released, I think there is some version of this app that would be useful. I can imagine running hundreds of simulations, then giving statistical results about which cards are good and bad. Or automatically optimizing a deck by swapping out cards for you.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/mtg-bench-testing-ho…] indexed:0 read:5min 2026-06-11 ·