{"slug": "arbiter-reasoning-trajectory-basins-and-majority-vote-failures-in-test-time", "title": "ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling", "summary": "A new study reveals that language model reasoning trajectories during test-time sampling cluster into \"reasoning basins,\" causing majority votes to favor stable but potentially incorrect answers. The researchers introduce ARBITER, a model-agnostic method that recovers accuracy by adding same-model evidence to the majority prior, improving performance across multiple model families and math benchmarks without external information.", "body_md": "arXiv:2605.26172v1 Announce Type: new\nAbstract: When language models use test-time sampling, they generate multiple reasoning trajectories and select an answer by majority vote. We show that these trajectories are not independent: for a given question, they concentrate into a small number of clusters, or reasoning basins, each defined by a normalized final answer and the solutions that reach it. A majority vote therefore selects the most stable basin rather than the most accurate one, which creates wrong-majority failures where the correct answer is present but outvoted. We introduce ARBITER, a model-agnostic approach that models interactions between basins using only the base model's own sampled outputs, hidden states, and derived evidence. Most direct correction strategies fail; ARBITER instead uses conservative additive evidence on top of consensus. In its simplest parameter-free form, ARBITER-{\\Delta} adds same-model evidence to the majority prior, while ARBITER-Enc augments this with bounded residual signals from hidden states over complete solutions. On GSM8K with Qwen3-4B, consensus over K=24 samples achieves around the mid-94% range, while a same-pool top-2 oracle reaches around the mid-96% range. ARBITER recovers a subset of these cases using zero external information. Across three model families and three math benchmarks, it yields consistent gains with no net-negative cases; for example, on Llama-3.1-8B MMLU-HS-Math, it improves accuracy from the mid-78% range to the mid-82% range, recovering about 22% of the available oracle headroom, indicating that this headroom can be partially recovered from the sample pool itself.", "url": "https://wpnews.pro/news/arbiter-reasoning-trajectory-basins-and-majority-vote-failures-in-test-time", "canonical_source": "https://arxiv.org/abs/2605.26172", "published_at": "2026-05-27 04:00:00+00:00", "updated_at": "2026-05-27 04:28:57.479467+00:00", "lang": "en", "topics": ["large-language-models", "machine-learning", "artificial-intelligence", "ai-research"], "entities": ["ARBITER", "Qwen3-4B", "Llama-3.1-8B", "GSM8K", "MMLU-HS-Math"], "alternates": {"html": "https://wpnews.pro/news/arbiter-reasoning-trajectory-basins-and-majority-vote-failures-in-test-time", "markdown": "https://wpnews.pro/news/arbiter-reasoning-trajectory-basins-and-majority-vote-failures-in-test-time.md", "text": "https://wpnews.pro/news/arbiter-reasoning-trajectory-basins-and-majority-vote-failures-in-test-time.txt", "jsonld": "https://wpnews.pro/news/arbiter-reasoning-trajectory-basins-and-majority-vote-failures-in-test-time.jsonld"}}