{"slug": "erased-but-exploitable-black-box-embedding-aware-prompting-against-unlearned-to", "title": "Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models", "summary": "Researchers have developed BEAP, a black-box adversarial prompting attack that exploits vulnerabilities in text-to-image diffusion models that have undergone machine unlearning. The attack uses a large language model to iteratively generate undetectable prompts that force the model to produce unlearned concepts, achieving over 60% higher attack success rates than prior methods. This work exposes a critical security gap in current unlearning approaches, as BEAP requires no access to model weights and evades existing safety filters.", "body_md": "arXiv:2605.26332v1 Announce Type: new\nAbstract: Machine unlearning aims to remove specific concepts from pretrained text-to-image diffusion models, yet several white- and black-box attacks have been introduced to make the model generate such unlearned concepts. These attacks, nevertheless, do not assume a realistic threat model, i.e. they either assume access to the model weights, or result in gibberish adversarial prompts that could be easily detected even through naive rule-based safeguarding. We aim to address this gap in this paper.\nWe introduce BEAP, a black-box, embedding-aware adversarial prompting attack that leverages a large language model (LLM) to iteratively generate effective adversarial prompts and exploit such hidden vulnerabilities.\nBEAP performs an embedding-aware search in text space, combining multiple reward signals: unlearned concept presence, text-image alignment, and image quality, to refine generated prompts.\nUnlike previous attack methods, BEAP keeps its prompts undetectable to safety filters while producing high-quality images.\nExtensive experiments show that BEAP improves the Attack Success Rate (ASR) by more than 60% over prior methods, while requiring only an average of fifteen prompts per successful attack. Warning: This paper contains model outputs that may be offensive or upsetting in nature.", "url": "https://wpnews.pro/news/erased-but-exploitable-black-box-embedding-aware-prompting-against-unlearned-to", "canonical_source": "https://arxiv.org/abs/2605.26332", "published_at": "2026-05-27 04:00:00+00:00", "updated_at": "2026-05-27 04:27:50.438711+00:00", "lang": "en", "topics": ["machine-learning", "generative-ai", "ai-safety", "ai-research", "large-language-models"], "entities": ["BEAP", "LLM"], "alternates": {"html": "https://wpnews.pro/news/erased-but-exploitable-black-box-embedding-aware-prompting-against-unlearned-to", "markdown": "https://wpnews.pro/news/erased-but-exploitable-black-box-embedding-aware-prompting-against-unlearned-to.md", "text": "https://wpnews.pro/news/erased-but-exploitable-black-box-embedding-aware-prompting-against-unlearned-to.txt", "jsonld": "https://wpnews.pro/news/erased-but-exploitable-black-box-embedding-aware-prompting-against-unlearned-to.jsonld"}}