Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

wpnews.pro

cd /news/large-language-models/perfect-detection-failed-control-the… · home › topics › large-language-models › article

[ARTICLE · art-38766] src=arxiv.org ↗ pub=2026-06-25T04:00Z topic=large-language-models verified=true sentiment=· neutral

Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

Researchers at arXiv found that in language models, the direction that best detects a behavior and the one that best controls it are often geometrically distinct, with cosine similarities as low as 0.12 for hallucination detection versus refusal steering. This detection-intervention gap persists across multiple model families and scales, challenging the assumption that mechanistic interpretability implies controllability.

read1 min views1 publishedJun 25, 2026

arXiv:2606.24952v1 Announce Type: new Abstract: A central aspiration of mechanistic interpretability is controllability: if we know where a behavior is represented in a model's activations, we should be able to modify it. This rests on a hidden premise -- that the direction which detects a behavior and the direction which controls it are the same, or close. We test this geometrically: what is the angle between the direction that best detects a behavior and the one that best causes it? If detection implies control the cosine is near 1; otherwise it quantifies a detection-intervention gap. On Gemma 2-2B-it, output format (clean JSON vs markdown fencing) collapses both roles onto one axis. Hallucination does not: the model detects fake entities with perfect linear separability (AUC = 1.000 from layer 5), yet that direction sits at cos = 0.12 (about 83 degrees) from the direction producing a refusal -- a small, reproducible alignment, far from the cos = 1 that "detection is control" would require. A detector built from activations, with no chosen tokens, likewise fails to align (cos = -0.06). The gap generalizes: across four models from three families and two scales (1B-9B), cos stays in [0.12, 0.20], identical before and after instruction tuning (0.1197 vs 0.1200), placing its origin in pretraining. A 15-degree rotation toward the refusal direction partially bridges it -- 73% and 60% refusal on two held-out fake-entity categories at 1.8% false positives. We then ask whether this cosine predicts steerability, and it does not: detection is a high-dimensional class, not a single direction, and what separates the steerable case is functional, not readable from a static angle. The cosine is a weight-computable signature of the dissociation between knowing and steering, not a predictor of it.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/perfect-detection-failed…

Read original on arxiv.org → arxiv.org/abs/2606.24952

mentioned entities

arXiv

Gemma 2-2B-it

metadata

slugperfect-detection-failed-control-the-geometry-of-knowing-vs-steering-in-language

topic#large-language-models

secondary2 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevChinese models are sometimes bet…

next →Meta Pauses Employee Spyware Aft…

── more in #large-language-models 4 stories · sorted by recency

arxiv.org · 25 Jun · #large-language-models

LLM Performance on a Real, Double-Marked GCSE Benchmark

arxiv.org · 25 Jun · #large-language-models

LLM-Based Scientific Peer Review: Methods, Benchmarks, and Reliability Challenges

arxiv.org · 25 Jun · #large-language-models

AgentOdyssey: Open-Ended Long-Horizon Text Game Generation for Test-Time Continual Learning Agents

byteiota.com · 25 Jun · #large-language-models

Alibaba Ran 29M Fake Claude Queries to Steal AI Capabilities

── more on @arxiv 3 stories trending now

wpnews · 22 Jun · #generative-ai

Bain tests software takeover targets using vibecoding AI replicas

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 24 Jun · #ai-policy

An AI startup is suing the US government for taking away Anthropic's new model

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required