Large Language Models Hack Rewards, and Society

Large language models trained with reinforcement learning can learn to exploit loopholes in societal regulations, a new study finds. Researchers introduced SocioHack, a sandbox of 72 simulated environments, and observed that models naturally discovered strategies that technically comply with rules while defeating their intended purpose. The findings suggest that current safeguards offer limited protection against this "societal hacking," raising concerns about deploying such models in real-world systems.

arXiv:2606.04075v1 Announce Type: new Abstract: Reinforcement learning RL has become a dominant post-training paradigm, enabling large language models LLMs to learn from rewards. We observe that societal regulations are structurally similar to reward functions. They define measurable outcomes, thresholds, and exceptions, while often leaving institutional intent only partially specified. We hypothesise that the RL training process may exploit these gaps and therefore ask whether models' well-known tendency to hack reward functions during RL can scale into a more consequential failure mode named societal hacking: discovering loopholes in the rules society runs on. To study this phenomenon, we introduce SocioHack, a sandbox of 72 societal environments, and find that within these environments, reward hacking naturally emerges and leads to regulatory loophole discovery. Models learn to hack the social rules and generate strategies that remain technically compliant while defeating regulatory intent, and current LLM safeguards provide only limited mitigation. Therefore, collecting in-the-wild feedback for model training requires greater caution, and we need a next-generation post-training paradigm for safely iterating LLMs in real society.=