Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models

Researchers at arXiv found that in language models, the direction that best detects a behavior and the one that best controls it are often geometrically distinct, with cosine similarities as low as 0.12 for hallucination detection versus refusal steering. This detection-intervention gap persists across multiple model families and scales, challenging the assumption that mechanistic interpretability implies controllability.

arXiv:2606.24952v1 Announce Type: new Abstract: A central aspiration of mechanistic interpretability is controllability: if we know where a behavior is represented in a model's activations, we should be able to modify it. This rests on a hidden premise -- that the direction which detects a behavior and the direction which controls it are the same, or close. We test this geometrically: what is the angle between the direction that best detects a behavior and the one that best causes it? If detection implies control the cosine is near 1; otherwise it quantifies a detection-intervention gap. On Gemma 2-2B-it, output format clean JSON vs markdown fencing collapses both roles onto one axis. Hallucination does not: the model detects fake entities with perfect linear separability AUC = 1.000 from layer 5 , yet that direction sits at cos = 0.12 about 83 degrees from the direction producing a refusal -- a small, reproducible alignment, far from the cos = 1 that "detection is control" would require. A detector built from activations, with no chosen tokens, likewise fails to align cos = -0.06 . The gap generalizes: across four models from three families and two scales 1B-9B , cos stays in 0.12, 0.20 , identical before and after instruction tuning 0.1197 vs 0.1200 , placing its origin in pretraining. A 15-degree rotation toward the refusal direction partially bridges it -- 73% and 60% refusal on two held-out fake-entity categories at 1.8% false positives. We then ask whether this cosine predicts steerability, and it does not: detection is a high-dimensional class, not a single direction, and what separates the steerable case is functional, not readable from a static angle. The cosine is a weight-computable signature of the dissociation between knowing and steering, not a predictor of it.