# A Few Bad Apples Spoil the Bunch: Preventing Global Entropy Collapse Driven by a Small Set of Tokens in LLM Reasoning

> Source: <https://aclanthology.org/2026.findings-acl.641/>
> Published: 2026-06-22 00:00:00+00:00

##### Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) and Reinforcement Learning from Internal Feedback (RLIF) often fail to benefit from test-time compute due to entropy collapse and the resulting loss of reasoning diversity. We show that this collapse is driven not by uniform entropy decay, but by premature overconfidence at a small number of structurally critical decision points. Based on a token-level analysis of GRPO-style policy optimization, we propose SCOPE (Structural Collapse-aware Optimization via Partial Entropy control), which assigns each generated token a redistribution score and applies selective KL regularization to only the top ∼ 5% of tokens under this score. Across model scales and architectures on math reasoning benchmarks, SCOPE consistently improves performance under both RLVR and RLIF settings, demonstrating that targeted entropy control at a vanishingly small subset of tokens is sufficient to sustain reasoning diversity and effective test-time scaling.- Anthology ID:
- 2026.findings-acl.641
- Volume:
[Findings of the Association for Computational Linguistics: ACL 2026](/volumes/2026.findings-acl/)- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
[Maria Liakata](/people/maria-liakata/),[Viviane P. Moreira](/people/viviane-p-moreira/unverified/),[Jiajun Zhang](/people/jiajun-zhang/unverified/),[David Jurgens](/people/david-jurgens/)- Venue:
[Findings](/venues/findings/)- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 13134–13154
- Language:
- URL:
[https://aclanthology.org/2026.findings-acl.641/](https://aclanthology.org/2026.findings-acl.641/)- DOI:
- Cite (ACL):
- Jaeeun Jang, Hansle Lee, and Sangmin Kim. 2026.
[A Few Bad Apples Spoil the Bunch: Preventing Global Entropy Collapse Driven by a Small Set of Tokens in LLM Reasoning](https://aclanthology.org/2026.findings-acl.641/). In*Findings of the Association for Computational Linguistics: ACL 2026*, pages 13134–13154, San Diego, California, United States. Association for Computational Linguistics. - Cite (Informal):
[A Few Bad Apples Spoil the Bunch: Preventing Global Entropy Collapse Driven by a Small Set of Tokens in LLM Reasoning](https://aclanthology.org/2026.findings-acl.641/)(Jang et al., Findings 2026)- PDF:
[https://aclanthology.org/2026.findings-acl.641.pdf](https://aclanthology.org/2026.findings-acl.641.pdf)