ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling

wpnews.pro

cd /news/large-language-models/arbiter-reasoning-trajectory-basins-… · home › topics › large-language-models › article

[ARTICLE · art-14884] src=arxiv.org ↗ pub=2026-05-27T04:00Z topic=large-language-models verified=true sentiment=· neutral

ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling

A new study reveals that language model reasoning trajectories during test-time sampling cluster into "reasoning basins," causing majority votes to favor stable but potentially incorrect answers. The researchers introduce ARBITER, a model-agnostic method that recovers accuracy by adding same-model evidence to the majority prior, improving performance across multiple model families and math benchmarks without external information.

read1 min views10 publishedMay 27, 2026

arXiv:2605.26172v1 Announce Type: new Abstract: When language models use test-time sampling, they generate multiple reasoning trajectories and select an answer by majority vote. We show that these trajectories are not independent: for a given question, they concentrate into a small number of clusters, or reasoning basins, each defined by a normalized final answer and the solutions that reach it. A majority vote therefore selects the most stable basin rather than the most accurate one, which creates wrong-majority failures where the correct answer is present but outvoted. We introduce ARBITER, a model-agnostic approach that models interactions between basins using only the base model's own sampled outputs, hidden states, and derived evidence. Most direct correction strategies fail; ARBITER instead uses conservative additive evidence on top of consensus. In its simplest parameter-free form, ARBITER-{\Delta} adds same-model evidence to the majority prior, while ARBITER-Enc augments this with bounded residual signals from hidden states over complete solutions. On GSM8K with Qwen3-4B, consensus over K=24 samples achieves around the mid-94% range, while a same-pool top-2 oracle reaches around the mid-96% range. ARBITER recovers a subset of these cases using zero external information. Across three model families and three math benchmarks, it yields consistent gains with no net-negative cases; for example, on Llama-3.1-8B MMLU-HS-Math, it improves accuracy from the mid-78% range to the mid-82% range, recovering about 22% of the available oracle headroom, indicating that this headroom can be partially recovered from the sample pool itself.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/arbiter-reasoning-trajec…

Read original on arxiv.org → arxiv.org/abs/2605.26172

mentioned entities

ARBITER

Qwen3-4B

Llama-3.1-8B

GSM8K

MMLU-HS-Math

metadata

slugarbiter-reasoning-trajectory-basins-and-majority-vote-failures-in-test-time

topic#large-language-models

secondary3 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevSejong University launches Asia’…

next →European AI adoption hits 99% wi…

── more in #large-language-models 4 stories · sorted by recency

arxiv.org · 24 Jun · #large-language-models

Weight-Space Geometry of Offline Reasoning Training

dev.to · 12 Jul · #large-language-models

My Experiment Showed Zero Effect. A Statistician Told Me My Measurement Was Broken.

dev.to · 12 Jul · #large-language-models

Bayesian Neural Networks

lesswrong.com · 11 Jul · #large-language-models

The Termination Circuit (how reasoning models stop thinking).

── more on @arbiter 3 stories trending now

wpnews · 30 May · #ai-safety

Nightcord Security Analysis Report - Threat Investigation

wpnews · 8 Jul · #artificial-intelligence

SpaceXAI unveils Grok 4.5 AI model ahead of July 2026 public release

wpnews · 8 Jul · #artificial-intelligence

xAI Launches Grok 4.5 With Pricing Built to Undercut Anthropic's Opus 4.8

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required