RepSelect: Robust LLM Unlearning via Representation Selectivity

Researchers propose RepSelect, a method for robust LLM unlearning that isolates forget-set-specific representations by collapsing top principal components of weight gradients, achieving 4-50x larger reduction in post-relearning answer accuracy than existing baselines across multiple model families and forget categories.

arXiv:2606.17168v1 Announce Type: new Abstract: Making large language models LLMs deeply forget specific knowledge and values without sacrificing general capabilities remains a central challenge in unlearning. However, current methods are easily reversed by fine-tuning or few-shot prompting, suggesting their forgetting is only shallow. We identify the root cause. Existing methods target representations shared with both the retain set and the subspace recovered by a fine-tuning attacker, making unlearning both disruptive to general capabilities and easy to reverse. We propose RepSelect Representation Selectivity , isolates forget-set-specific representations by collapsing top principal components of weight gradients before each update, leaving general capabilities intact while limiting what fine-tuning can recover. We evaluate across two forget categories, biohazardous knowledge and abusive tendencies, and four model families spanning dense and Mixture-of-Experts architectures Llama 3, Qwen 3.5, Gemma 4 E4B, DeepSeek V2 Lite . Compared to five popular baselines GradDiff, NPO, SimNPO, RMU, UNDIAL , RepSelect achieves a 4-50x larger reduction in post-relearning answer accuracy than the strongest baseline, and is near-perfectly robust to few-shot prompting attacks. Targeting selective representations is thus an important step towards deep and robust LLM forgetting.