{"slug": "expert-aware-refusal-steering", "title": "Expert-Aware Refusal Steering", "summary": "Researchers have extended refusal steering methods to Mixture-of-Experts (MoE) large language models, demonstrating that steering vectors can effectively suppress safety-aligned refusal behavior in these architectures. The team proposed two expert-aware refusal steering techniques that leverage routing patterns and expert-specific directions, finding that refusal behavior can be controlled based on a single expert's output. The results indicate that refusal signals captured by steering methods differ from expert routing behavior, suggesting attention mechanisms play a substantial role in MoE refusal responses.", "body_md": "arXiv:2606.04160v1 Announce Type: new\nAbstract: Safety alignment in instruction-tuned large language models (LLMs) depends on a model's ability to reliably refuse to respond to harmful or disallowed requests. Recent work has shown that a steering vector can be applied to a dense LLM during inference to effectively suppress refusal behavior, inducing response to harmful requests. We extend this refusal steering method to three open-source Mixture-of-Experts (MoE) LLMs and find that steering performance is uninhibited by the complex routing patterns inherent to the MoE architecture. We then propose two expert-aware refusal steering methods that leverage refusal-specific expert routing patterns and expert-specific steering directions to suppress normal refusal behavior. We find that refusal behavior can be effectively steered based on the output of a single expert. Our results show that refusal signals captured by steering methods differ from expert routing behavior, suggesting a substantial role for attention in MoE refusal behavior.", "url": "https://wpnews.pro/news/expert-aware-refusal-steering", "canonical_source": "https://arxiv.org/abs/2606.04160", "published_at": "2026-06-04 04:00:00+00:00", "updated_at": "2026-06-04 04:21:27.152928+00:00", "lang": "en", "topics": ["large-language-models", "ai-safety", "machine-learning", "artificial-intelligence", "neural-networks"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/expert-aware-refusal-steering", "markdown": "https://wpnews.pro/news/expert-aware-refusal-steering.md", "text": "https://wpnews.pro/news/expert-aware-refusal-steering.txt", "jsonld": "https://wpnews.pro/news/expert-aware-refusal-steering.jsonld"}}