Expert-Aware Refusal Steering

Researchers have extended refusal steering methods to Mixture-of-Experts (MoE) large language models, demonstrating that steering vectors can effectively suppress safety-aligned refusal behavior in these architectures. The team proposed two expert-aware refusal steering techniques that leverage routing patterns and expert-specific directions, finding that refusal behavior can be controlled based on a single expert's output. The results indicate that refusal signals captured by steering methods differ from expert routing behavior, suggesting attention mechanisms play a substantial role in MoE refusal responses.

arXiv:2606.04160v1 Announce Type: new Abstract: Safety alignment in instruction-tuned large language models LLMs depends on a model's ability to reliably refuse to respond to harmful or disallowed requests. Recent work has shown that a steering vector can be applied to a dense LLM during inference to effectively suppress refusal behavior, inducing response to harmful requests. We extend this refusal steering method to three open-source Mixture-of-Experts MoE LLMs and find that steering performance is uninhibited by the complex routing patterns inherent to the MoE architecture. We then propose two expert-aware refusal steering methods that leverage refusal-specific expert routing patterns and expert-specific steering directions to suppress normal refusal behavior. We find that refusal behavior can be effectively steered based on the output of a single expert. Our results show that refusal signals captured by steering methods differ from expert routing behavior, suggesting a substantial role for attention in MoE refusal behavior.