{"slug": "self-play-reinforcement-learning-under-imperfect-information-in-big-2", "title": "Self-Play Reinforcement Learning under Imperfect Information in Big 2", "summary": "Researchers developed a self-play reinforcement learning framework for the four-player imperfect-information card game Big 2, enabling controlled comparisons of different RL agents. Under standardized conditions, Proximal Policy Optimization (PPO) outperformed Monte Carlo Q approximation, SARSA, and Q-learning against random, greedy, and heuristic opponents. The study found that moderate entropy regularization and current-policy self-play improved PPO's performance, establishing Big 2 as a useful benchmark for studying deep RL under hidden information, multiplayer dynamics, and delayed rewards.", "body_md": "arXiv:2605.28863v1 Announce Type: new\nAbstract: Imperfect-information multiplayer games test whether agents can act under hidden information, sparse rewards, and non-stationary opponents. We study these challenges in Big 2, a four-player imperfect-information card game. We develop a self-play RL framework for Big 2 that enables controlled comparisons between policy-gradient and value-approximating agents. Under a common environment, input representation, training budget, and evaluation protocol, PPO outperforms Monte Carlo Q approximation, SARSA, and Q-learning against random, greedy, and heuristic Big 2 opponents. We further find that moderate entropy regularization improves PPO by preventing the policy from becoming overly deterministic, and that current-policy self-play provides a stronger finite-budget curriculum than checkpoint self-play or fixed-opponent training. Together, these results show that Big 2 is a useful controlled setting for studying deep RL under imperfect information, multiplayer interaction, delayed rewards, and variable action sets.", "url": "https://wpnews.pro/news/self-play-reinforcement-learning-under-imperfect-information-in-big-2", "canonical_source": "https://arxiv.org/abs/2605.28863", "published_at": "2026-05-29 04:00:00+00:00", "updated_at": "2026-05-29 04:17:29.665771+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "ai-research", "ai-agents"], "entities": ["PPO", "Monte Carlo Q approximation", "SARSA", "Q-learning", "Big 2"], "alternates": {"html": "https://wpnews.pro/news/self-play-reinforcement-learning-under-imperfect-information-in-big-2", "markdown": "https://wpnews.pro/news/self-play-reinforcement-learning-under-imperfect-information-in-big-2.md", "text": "https://wpnews.pro/news/self-play-reinforcement-learning-under-imperfect-information-in-big-2.txt", "jsonld": "https://wpnews.pro/news/self-play-reinforcement-learning-under-imperfect-information-in-big-2.jsonld"}}