arXiv:2605.28863v1 Announce Type: new Abstract: Imperfect-information multiplayer games test whether agents can act under hidden information, sparse rewards, and non-stationary opponents. We study these challenges in Big 2, a four-player imperfect-information card game. We develop a self-play RL framework for Big 2 that enables controlled comparisons between policy-gradient and value-approximating agents. Under a common environment, input representation, training budget, and evaluation protocol, PPO outperforms Monte Carlo Q approximation, SARSA, and Q-learning against random, greedy, and heuristic Big 2 opponents. We further find that moderate entropy regularization improves PPO by preventing the policy from becoming overly deterministic, and that current-policy self-play provides a stronger finite-budget curriculum than checkpoint self-play or fixed-opponent training. Together, these results show that Big 2 is a useful controlled setting for studying deep RL under imperfect information, multiplayer interaction, delayed rewards, and variable action sets.
Self-Play Reinforcement Learning under Imperfect Information in Big 2
Researchers developed a self-play reinforcement learning framework for the four-player imperfect-information card game Big 2, enabling controlled comparisons of different RL agents. Under standardized conditions, Proximal Policy Optimization (PPO) outperformed Monte Carlo Q approximation, SARSA, and Q-learning against random, greedy, and heuristic opponents. The study found that moderate entropy regularization and current-policy self-play improved PPO's performance, establishing Big 2 as a useful benchmark for studying deep RL under hidden information, multiplayer dynamics, and delayed rewards.
Run your AI side-project on zahid.host
EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.