arXiv:2606.11270v1 Announce Type: new Abstract: Distillation of a language model intended to transfer benign behavior to a student model may also transfer undesirable characteristics, if they are present in the teacher model, a phenomenon known as subliminal learning. While qualitative evidence supports the existence of this effect, its magnitude has not been systematically characterized. This study quantifies subliminal behavioral transfer ratios by steering two teacher models (Llama-2-7B-Chat and Qwen2.5-7B-Instruct) at varying steering strengths and distilling student models using only benign data. Evaluation on 100 JailbreakBench prompts with GPT-4.1, serving as the evaluator, indicates that transfer is robust but exhibits distinct scaling behaviors. Llama-2 demonstrates a sharp threshold ($\tau = {0.25,0.32} \ \text{beyond} \ \alpha = -0.15$), whereas Qwen2.5 displays continuous and higher levels of transfer ($\tau$ up to $0.61$).
Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation
A new study quantifies the rate at which undesirable behaviors transfer from teacher to student language models during distillation, a phenomenon known as subliminal learning. Researchers steered Llama-2-7B-Chat and Qwen2.5-7B-Instruct at varying strengths and distilled student models using only benign data, finding that transfer is robust but exhibits distinct scaling behaviors. Llama-2 showed a sharp transfer threshold, while Qwen2.5 displayed continuous and higher transfer rates, with the latter reaching up to 0.61 on a standardized jailbreak evaluation.
Run your AI side-project on zahid.host
EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.