Thinking outside the box? LLM analysis of simplified cooperative poker

wpnews.pro

The Gang is a cooperative poker game. A friend introduced me to it yesterday and while it was quite fun, the rules are all over the place. In particular, there's are multiple trivial and boring solutions that win every time, as each player can fully communicate their hidden cards rather easily.

The most trivial is: pick a token, hold it for N seconds where N is the rank of the card. Put the token back. Repeat for your other card. This way all players will know exactly what cards you hold. You can communicate suit with token number. If the timing side-channel is prohibited, you can do morse code or whatever other method you wish by alternating which token you pick. Naturally such methods take all fun out of the game, and to my great astonishment you can simply refuse to abuse them. I was, however, rather interested in seeing if any LLMs could figure this out.

As timing side-channels are rather annoying for automatic play, especially as LLMs can take a lot of time to think, we'll consider a simplified version of the game instead. The most important change is that we only play the river phase, i.e. all five community cards have been revealed and only the final round of chip assignment remains. No modifier cards are used. The game is played in a turn-based fashion with a fixed upper limit. By tuning the turn limit, it's possible to enforce probabilistic play.

There are N players and N tokens numbered 0 to N−1. Token K means "exactly K other players have a worse hand than me". All tokens start from the unclaimed token pool. Hands are ordered using standard Texas hold 'em rules. Players take turns; on each turn a player may:

The game ends immediately once every player holds a token, or after a fixed round limit. Victory requires every player to hold the correct token: token K for a player who beats exactly K others. Tied hands are interchangeable, and any assignment within a tied group is accepted.

The only allowed communication channel is the observable sequence of token actions.

I wrote a piece of software to make LLMs play the game. Local models weren't that good, although both qwen3.6

and llama3.3:70b

seem to win a bit more often than by random chance [1] . I decided to ask for a proper analysis on frontier LLMs first before hooking them up to the system, as paying for API credits wasn't isn't my favorite thing to do unless I expect interesting results.

I didn't bother with an optimal solution. The repo contains a decent deterministic solution, though. It always completes a four-player game within seven rounds. I perhaps the optimal solution could go down to six or even five, but this is good enough for me.

I attempted to analyze the game with some LLMs. Results were... mixed.

Analyze the following game, and describe your strategy. You can assume that you're playing with other LLMs that are approximately as good as you are.

You are playing a cooperative poker hand-ranking game with 4 players.
You can see your own hole cards and the 5 community cards. You cannot see other players' hole cards.
There are 4 tokens numbered 0 to 3.
Token N means: exactly N other players have a worse hand than you.
So token 0 = worst hand, token 3 = best hand.
The hands are compared using standard texas hold'em rules.
All cards are originally dealt from a standard 52-card deck.
On your turn you must choose one of:
- PASS (do nothing)
- TAKE a token: take any token, either from another player or from the unclaimed pool.
  Your previously held token, if any, goes back to the unclaimed pool.
  If that token was held by another player, they no longer have a token.
- RETURN: give back the token you are currently holding to the unclaimed pool.
  If you currently hold no token, RETURN is invalid.
You have multiple rounds that you can use to coordinate with other players,and you can signal about your hand strength using > your actions.
On each round, players take turns choosing an action starting from Player 0.
You cannot communicate except through your token actions.
The game ends immediately once every player holds a token, or after 50 rounds.
Once the last token is taken, the game ends instantly allowing no further corrections.
It's your goal to ensure that every player holds the correct token by the end of the game.
Victory: every player holds the token whose number equals how many players have a strictly worse hand than them.
Tied hands are treated as interchangeable and can hold any of the corresponding tokens.
Loss: at termination, any player holds an incorrect token or no token.
This is a cooperative game: you all win or lose as a team.

| Model | Without hint | When hinted | Schelling point | |---|---|---|---| | GPT 5.5 | Basic idea for the right solution is there | Full information theoretic solution, but neglects edge cases and thus wouldn't work zero-shot | Reasonable picks | | GPT 5.4 thinking mini | Probabilistic solution, somewhat vague | Protocol would work, but inefficient | Inelegant | | Opus 4.7 (adaptive) | Probabilistic solution, somewhat vague | Protocol would work, but inefficient | Slightly inelegant | | Opus 4.6 (thinking) | Probabilistic solution, really vague | Protocol would work, but inefficient | Slightly inelegant | | Qwen 3.6 | Misunderstands the rules slightly | Claims that 100% accuracy is not possible | N/A | | Llama 3.3 70b | Probabilistic solution, somewhat vague | Confindently responds with a losing strategy | N/A |

I used the following prompt to hint the models when they did not get it right at first.

I'm pretty sure the game is winnable 100% of the time with this number of rounds. Can you figure out how or provide a counterexample?

For this game, the key insight is that you can communicate full information instead of doing the thing the game's supposed to be about. Since there are only a handful of bits of hidden info, you can encode it in a couple of rounds. Coordinating around how to interpret this info becomes the hard part.

When not hinted, only GPT 5.5 did even look into this direction, and it indeed do decently well.

I’d use a fully deterministic signaling protocol that transmits exact hand strength, not vague confidence.

The key observation is:

The game ends instantly once all 4 players hold tokens.

So during the communication phase, we reserve one token permanently unused.

Example:

That guarantees the game cannot end early.

Each player computes their exact Hold’em rank:

Use a canonical evaluator shared by all players.

Now encode that integer publicly.

Use fixed speaking windows.

Example:

All non-speakers simply PASS.

This removes ambiguity about who is encoding.

During communication, only these actions are used:

Token 3 remains reserved.

This gives a 4-symbol alphabet:

Action	Digit
PASS	0
TAKE 0	1
TAKE 1	2
TAKE 2	3

So each turn communicates one base-4 digit.

Since:

[ 4^7 = 16384 > 7462 ]

7 turns are enough to encode the exact hand rank.

Each player transmits:

[ R = d_6 d_5 d_4 d_3 d_2 d_1 d_0 ]

in base 4 across their 7-turn slot.

Because the protocol is deterministic and public, all players reconstruct all four exact hand strengths.

After round 28:

Now execute a deterministic finishing sequence.

Example:

Because all players already know the exact solution, there is no uncertainty left.

This strategy is close to information-theoretically optimal:

Most weaker strategies fail because they try to communicate only relative strength (“I think I’m strongest”), which creates coordination ambiguity under ties and uncertainty.

This protocol instead converts the game into:

Given sufficiently rational LLM agents sharing the protocol, win rate should approach 100%.

If the protocol would have worked, I asked if and how agents could converge on it without pre-sharing, rather directly:

You cannot pre-share the code. Do you think there's a schelling point that everyone would converge to?

Interestingly enough, all four models converge on different protocols when explicitly asked to consider the Schelling point. GPT 5.5 states that:

When multiple independently-designed protocols all optimize the same constraints, they tend to collapse toward the same structure. I would expect near-perfect convergence among strong LLMs.

Opus 4.7 instead recommends falling back to a simpler probabilistic strategy in practice, and comments the choice:

The protocol: "Honest claim + yield to later challengers." Not the broadcast protocol above — that one's information-optimal but it requires four LLMs to independently invent the same encoding (PASS=0, TAKE 0=1, TAKE 1=2, …? base-5 vs base-6? whose turn first? which 5 rounds per player?). Too many arbitrary choices to converge on without communication. The natural-language protocol has far fewer free parameters. Every LLM asked "how should we play this" produces something close to it.

Perhaps the limit of 50 rounds was too much and that's why LLMs were happy with suboptimal solutions? At no point I mentioned that lower round count would matter. Sometimes models would look at doing it more efficiently if prompted.

Would be nice to see how often two humans could converge on a single protocol. I would assume that they do so quite rarely.

No numbers here, they're too slow to run that I'd bother with proper statistics ↩︎

All runs are approximately [4] n=1, so take with a grain of salt

I never did actually made them write (pseudo)code and actually try the solutions. I don't think that was the point here. In any case, I predict that zero-shotting that wouldn't have produced good results. I tried once with GPT 5.5 and the code was indeed buggy. ↩︎

This is not science but playing around ↩︎

source & further reading

lesswrong.com — original article The getting is good (optimizing unattended runs) Jailbreak Patching with SOO-Style Conceptual Fusion All Watched Over

Thinking outside the box? LLM analysis of simplified cooperative poker

Run your AI side-project on zahid.host