Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation

wpnews.pro

cd /news/large-language-models/quantifying-subliminal-behavioral-tr… · home › topics › large-language-models › article

[ARTICLE · art-24752] src=arxiv.org pub=2026-06-12T04:00Z topic=large-language-models verified=true sentiment=· neutral

Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation

A new study quantifies the rate at which undesirable behaviors transfer from teacher to student language models during distillation, a phenomenon known as subliminal learning. Researchers steered Llama-2-7B-Chat and Qwen2.5-7B-Instruct at varying strengths and distilled student models using only benign data, finding that transfer is robust but exhibits distinct scaling behaviors. Llama-2 showed a sharp transfer threshold, while Qwen2.5 displayed continuous and higher transfer rates, with the latter reaching up to 0.61 on a standardized jailbreak evaluation.

read1 min publishedJun 12, 2026

arXiv:2606.11270v1 Announce Type: new Abstract: Distillation of a language model intended to transfer benign behavior to a student model may also transfer undesirable characteristics, if they are present in the teacher model, a phenomenon known as subliminal learning. While qualitative evidence supports the existence of this effect, its magnitude has not been systematically characterized. This study quantifies subliminal behavioral transfer ratios by steering two teacher models (Llama-2-7B-Chat and Qwen2.5-7B-Instruct) at varying steering strengths and distilling student models using only benign data. Evaluation on 100 JailbreakBench prompts with GPT-4.1, serving as the evaluator, indicates that transfer is robust but exhibits distinct scaling behaviors. Llama-2 demonstrates a sharp threshold ($\tau = {0.25,0.32} \ \text{beyond} \ \alpha = -0.15$), whereas Qwen2.5 displays continuous and higher levels of transfer ($\tau$ up to $0.61$).

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/quantifying-subliminal-b…

Read original on arxiv.org → arxiv.org/abs/2606.11270

mentioned entities

Llama-2-7B-Chat

Qwen2.5-7B-Instruct

GPT-4.1

JailbreakBench

metadata

slugquantifying-subliminal-behavioral-transfer-ratios-in-language-model-distillation

topic#large-language-models

secondary4 topics

sentimentneutral

langen

canonicalarxiv.org

navigation

← prevLinear Coding Sessions

next →Your new car is getting harder a…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 12 Jun · #large-language-models

AI Evals, Part 2: Error Analysis The Unglamorous Superpower Behind Good Evals

axios.com · 12 Jun · #large-language-models

Trump admin blocks foreign access to Anthropic's most powerful AI models

lesswrong.com · 12 Jun · #large-language-models

When Emotion Descriptors Fail: AI-Native Functions of Emotion Vectors

cryptobriefing.com · 12 Jun · #large-language-models

OpenAI subpoenaed for documents on user impact and activities

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required