Localizing Anchoring Pathways in Language Models

Researchers have identified specific neural pathways in large language models that carry anchoring bias signals, where irrelevant numbers in prompts skew numerical reasoning. Using attribution-based circuit localization on 7B-8B Qwen and Llama models, the team found that edge-level methods more faithfully recover these signals than node-level methods. The findings reveal that while low- and high-anchor circuits transfer strongly within a single model, post-training changes between base and instruction-tuned variants alter which pathways matter most.

arXiv:2606.12818v1 Announce Type: new Abstract: Irrelevant numbers in a prompt can shift language model judgments, producing anchoring effects in numerical reasoning. We study where this anchor-sensitive signal is carried inside language models using a controlled multiple-choice setup with shared answer options. We define a logit-difference metric comparing the correct answer option with the answer option corresponding to the anchor, and validate that it tracks behavioral anchoring. Using attribution-based circuit localization on 7B--8B Qwen and Llama base and instruction-tuned models, we find that edge-level methods recover this signal more faithfully than node-level methods. Low- and high-anchor circuits transfer strongly within a model, suggesting shared pathway structure across anchor direction. However, sparse transfer across base and instruction-tuned variants is less reliable, indicating that post-training changes which pathways matter most. Overall, our results provide a mechanistic account of how anchoring-related decision signals are carried inside language models.