{"slug": "perfect-detection-failed-control-the-geometry-of-knowing-vs-steering-in-language", "title": "Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models", "summary": "Researchers at arXiv found that in language models, the direction that best detects a behavior and the one that best controls it are often geometrically distinct, with cosine similarities as low as 0.12 for hallucination detection versus refusal steering. This detection-intervention gap persists across multiple model families and scales, challenging the assumption that mechanistic interpretability implies controllability.", "body_md": "arXiv:2606.24952v1 Announce Type: new\nAbstract: A central aspiration of mechanistic interpretability is controllability: if we know where a behavior is represented in a model's activations, we should be able to modify it. This rests on a hidden premise -- that the direction which detects a behavior and the direction which controls it are the same, or close. We test this geometrically: what is the angle between the direction that best detects a behavior and the one that best causes it? If detection implies control the cosine is near 1; otherwise it quantifies a detection-intervention gap. On Gemma 2-2B-it, output format (clean JSON vs markdown fencing) collapses both roles onto one axis. Hallucination does not: the model detects fake entities with perfect linear separability (AUC = 1.000 from layer 5), yet that direction sits at cos = 0.12 (about 83 degrees) from the direction producing a refusal -- a small, reproducible alignment, far from the cos = 1 that \"detection is control\" would require. A detector built from activations, with no chosen tokens, likewise fails to align (cos = -0.06). The gap generalizes: across four models from three families and two scales (1B-9B), cos stays in [0.12, 0.20], identical before and after instruction tuning (0.1197 vs 0.1200), placing its origin in pretraining. A 15-degree rotation toward the refusal direction partially bridges it -- 73% and 60% refusal on two held-out fake-entity categories at 1.8% false positives. We then ask whether this cosine predicts steerability, and it does not: detection is a high-dimensional class, not a single direction, and what separates the steerable case is functional, not readable from a static angle. The cosine is a weight-computable signature of the dissociation between knowing and steering, not a predictor of it.", "url": "https://wpnews.pro/news/perfect-detection-failed-control-the-geometry-of-knowing-vs-steering-in-language", "canonical_source": "https://arxiv.org/abs/2606.24952", "published_at": "2026-06-25 04:00:00+00:00", "updated_at": "2026-06-25 04:14:52.851020+00:00", "lang": "en", "topics": ["large-language-models", "ai-safety", "ai-research"], "entities": ["arXiv", "Gemma 2-2B-it"], "alternates": {"html": "https://wpnews.pro/news/perfect-detection-failed-control-the-geometry-of-knowing-vs-steering-in-language", "markdown": "https://wpnews.pro/news/perfect-detection-failed-control-the-geometry-of-knowing-vs-steering-in-language.md", "text": "https://wpnews.pro/news/perfect-detection-failed-control-the-geometry-of-knowing-vs-steering-in-language.txt", "jsonld": "https://wpnews.pro/news/perfect-detection-failed-control-the-geometry-of-knowing-vs-steering-in-language.jsonld"}}