{"slug": "the-readout-shortcut-positional-number-copying-dominates-arithmetic-cot-readout", "title": "The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models", "summary": "A new study of small language models reveals that chain-of-thought prompting for arithmetic relies on a positional shortcut: the model copies whichever number appears last before the answer delimiter, regardless of the logical reasoning steps. This copy channel accounts for 89-92% of each model's accuracy ceiling on GSM8K, and replacing the trailing number with a wrong value collapses performance even when correct intermediate steps remain. The findings indicate that step-level faithfulness evaluations may conflate positional number transport with genuine computation, posing a failure mode for chain-of-thought-based oversight.", "body_md": "arXiv:2605.22870v1 Announce Type: new\nAbstract: Chain-of-thought (CoT) prompting is necessary for arithmetic in small language models, yet shuffling its steps preserves most performance. What does CoT contribute if not logical sequencing? In three 1-3B instruction-tuned LMs on GSM8K, we isolate the answer-readout stage via prefix completion and identify a positional shortcut: the model copies whichever number occupies the trailing position before the answer delimiter, regardless of intermediate reasoning. Gold-answer presence accounts for 54-92 pp of accuracy (89-92% of each model's teacher-forcing ceiling); even on incorrect items, the final answer matches the last CoT number 95-96% of the time. The copy channel takes precedence over retained-context completion: replacing the trailing number with a wrong value collapses accuracy to near-zero despite correct intermediates, yet removing it recovers 5-32 pp above that floor--even single-step arithmetic the model can otherwise perform is suppressed when a copyable number is present. Qwen and Llama copy novel distractors 87-95% of the time; Gemma gates selectively. Head-level ablation implicates architecture-specific head sets; the effect replicates on GSM-Symbolic. On non-arithmetic BBH tasks, shuffle retention drops sharply; at 7-8B, content-selective gating emerges. Step-level faithfulness evaluations risk conflating positional answer transport with genuine computation--a failure mode for CoT-based oversight.", "url": "https://wpnews.pro/news/the-readout-shortcut-positional-number-copying-dominates-arithmetic-cot-readout", "canonical_source": "https://arxiv.org/abs/2605.22870", "published_at": "2026-05-25 04:00:00+00:00", "updated_at": "2026-05-25 15:13:10.213717+00:00", "lang": "en", "topics": ["large-language-models", "natural-language-processing", "ai-research", "machine-learning"], "entities": ["GSM8K", "Qwen", "Llama", "Gemma", "BBH", "GSM-Symbolic"], "alternates": {"html": "https://wpnews.pro/news/the-readout-shortcut-positional-number-copying-dominates-arithmetic-cot-readout", "markdown": "https://wpnews.pro/news/the-readout-shortcut-positional-number-copying-dominates-arithmetic-cot-readout.md", "text": "https://wpnews.pro/news/the-readout-shortcut-positional-number-copying-dominates-arithmetic-cot-readout.txt", "jsonld": "https://wpnews.pro/news/the-readout-shortcut-positional-number-copying-dominates-arithmetic-cot-readout.jsonld"}}