PreUnlearn: Auditing Collateral Knowledge Damage Before Large Language Model Unlearning

Researchers found that unlearning knowledge from large language models causes collateral damage that decays with semantic distance but persists across domains. They developed a method to audit this damage before unlearning by analyzing interaction features between forget and evaluation sets, enabling early identification of risky unlearning runs.

arXiv:2606.18473v1 Announce Type: new Abstract: Machine unlearning for large language models LLMs aims to remove specified knowledge while preserving the rest of the model's capabilities. However, the boundary between knowledge to forget and knowledge to retain is often unclear, since related and even distant information may be entangled in the model. In this paper, we study LLM unlearning from a data-centric perspective and measure how unlearning effects propagate from the forget set to same-domain and distant-domain knowledge. We find a consistent decay pattern: collateral damage is strongest near the forget set, weakens with semantic distance, but does not disappear at domain boundaries. We further ask whether such damage can be audited before unlearning is executed. We formulate forget-set auditing as a pre-unlearning prediction task and analyze which data features are most predictive of downstream damage. Our results show that interaction features between the forget set and evaluation set provide the strongest signals, suggesting that collateral damage is partly reflected in data geometry before model updates occur. These findings position forget-set auditing as an early warning tool for identifying risky unlearning runs and designing more reliable unlearning procedures.