A Few Bad Apples Spoil the Bunch: Preventing Global Entropy Collapse Driven by a Small Set of Tokens in LLM Reasoning

Researchers at ACL 2026 found that entropy collapse in LLM reasoning, which undermines test-time scaling, is driven by premature overconfidence at a small set of critical tokens. They proposed SCOPE, a method that applies selective KL regularization to only the top 5% of tokens, which consistently improved performance on math reasoning benchmarks across model scales and architectures.

Abstract Reinforcement Learning with Verifiable Rewards RLVR and Reinforcement Learning from Internal Feedback RLIF often fail to benefit from test-time compute due to entropy collapse and the resulting loss of reasoning diversity. We show that this collapse is driven not by uniform entropy decay, but by premature overconfidence at a small number of structurally critical decision points. Based on a token-level analysis of GRPO-style policy optimization, we propose SCOPE Structural Collapse-aware Optimization via Partial Entropy control , which assigns each generated token a redistribution score and applies selective KL regularization to only the top ∼ 5% of tokens under this score. Across model scales and architectures on math reasoning benchmarks, SCOPE consistently improves performance under both RLVR and RLIF settings, demonstrating that targeted entropy control at a vanishingly small subset of tokens is sufficient to sustain reasoning diversity and effective test-time scaling.- Anthology ID: - 2026.findings-acl.641 - Volume: Findings of the Association for Computational Linguistics: ACL 2026 /volumes/2026.findings-acl/ - Month: - July - Year: - 2026 - Address: - San Diego, California, United States - Editors: Maria Liakata /people/maria-liakata/ , Viviane P. Moreira /people/viviane-p-moreira/unverified/ , Jiajun Zhang /people/jiajun-zhang/unverified/ , David Jurgens /people/david-jurgens/ - Venue: Findings /venues/findings/ - SIG: - Publisher: - Association for Computational Linguistics - Note: - Pages: - 13134–13154 - Language: - URL: https://aclanthology.org/2026.findings-acl.641/ https://aclanthology.org/2026.findings-acl.641/ - DOI: - Cite ACL : - Jaeeun Jang, Hansle Lee, and Sangmin Kim. 2026. A Few Bad Apples Spoil the Bunch: Preventing Global Entropy Collapse Driven by a Small Set of Tokens in LLM Reasoning https://aclanthology.org/2026.findings-acl.641/ . In Findings of the Association for Computational Linguistics: ACL 2026 , pages 13134–13154, San Diego, California, United States. Association for Computational Linguistics. - Cite Informal : A Few Bad Apples Spoil the Bunch: Preventing Global Entropy Collapse Driven by a Small Set of Tokens in LLM Reasoning https://aclanthology.org/2026.findings-acl.641/ Jang et al., Findings 2026 - PDF: https://aclanthology.org/2026.findings-acl.641.pdf https://aclanthology.org/2026.findings-acl.641.pdf