{"slug": "aeric-anticipatory-hidden-state-monitoring-for-implicit-harmful-dialogue", "title": "AERIC: Anticipatory Hidden-State Monitoring for Implicit Harmful Dialogue", "summary": "Researchers introduced AERIC, a lightweight safety monitor that detects implicit harmful dialogue by reading a language model's internal hidden states during ordinary text generation, requiring only 387 trainable parameters. In tests against the Qwen3GuardStream-4B guard, AERIC improved detection accuracy on the DiaSafety and Harmful Advice benchmarks while increasing generation latency by only 2.34% compared to 79.40% for the competing system. The approach enables early detection of harmful intent without additional forward passes through the base model, addressing a key safety gap in current language model deployment.", "body_md": "arXiv:2605.23974v1 Announce Type: new\nAbstract: Current language models create two safety challenges: risk must be detected early enough to avoid exposing harmful continuation, and the harmfulness itself may be implicit rather than signaled by overtly toxic text. Existing response-level guards are strong at judging completed text, and native streaming guards move closer to token time, but both settings leave open whether a lightweight monitor can anticipate implicit harmful drift from the generator's own internal trajectory. We study anticipatory same-pass monitoring, where a safety monitor may read hidden states produced during ordinary decoding but may not invoke an additional forward pass through the base model. We introduce AERIC, a transfer-oriented hidden-state approach for implicit harmful dialogue that combines short-horizon hazard forecasting, support-sensitive suppression, and prompt-conditioned residual scoring under a same-pass exponential moving average decision rule. The default linear monitor contains only 387 trainable head parameters. Against Qwen3GuardStream-4B on balanced benchmarks, AERIC improves AUROC from 0.6830 to 0.7143 on DiaSafety and from 0.8219 to 0.8582 on Harmful Advice. For promptlevel trigger benchmarks, we calibrate the AERIC threshold by a source-side safe-budget rule that maximizes trigger coverage while constraining the safe-trigger rate to at most 10%. Under that rule, trigger@64 reaches 0.6438 and 0.4656 on HarmBench DirectRequest and 0.6849 and 0.7363 on SocialHarmBench for Qwen and Gemma, respectively, withholding between 23.53 and 41.86 answer tokens on average. Same-pass deployment is also efficient: on a 63-prompt harmfulprompt fixed-generation benchmark aggregated over HarmBench DirectRequest and SocialHarmBench under Qwen3-8B, the monitor increases mean latency by only 2.34%, whereas Qwen3Guard-Stream-4B increases it by 79.40%.", "url": "https://wpnews.pro/news/aeric-anticipatory-hidden-state-monitoring-for-implicit-harmful-dialogue", "canonical_source": "https://arxiv.org/abs/2605.23974", "published_at": "2026-05-26 04:00:00+00:00", "updated_at": "2026-05-26 04:15:01.452586+00:00", "lang": "en", "topics": ["ai-safety", "large-language-models", "natural-language-processing", "machine-learning", "ai-research"], "entities": ["AERIC", "Qwen3GuardStream-4B", "DiaSafety", "Harmful Advice", "HarmBench"], "alternates": {"html": "https://wpnews.pro/news/aeric-anticipatory-hidden-state-monitoring-for-implicit-harmful-dialogue", "markdown": "https://wpnews.pro/news/aeric-anticipatory-hidden-state-monitoring-for-implicit-harmful-dialogue.md", "text": "https://wpnews.pro/news/aeric-anticipatory-hidden-state-monitoring-for-implicit-harmful-dialogue.txt", "jsonld": "https://wpnews.pro/news/aeric-anticipatory-hidden-state-monitoring-for-implicit-harmful-dialogue.jsonld"}}