Weight-Space Geometry of Offline Reasoning Training

wpnews.pro

cd /news/machine-learning/weight-space-geometry-of-offline-rea… · home › topics › machine-learning › article

[ARTICLE · art-37227] src=arxiv.org ↗ pub=2026-06-24T04:00Z topic=machine-learning verified=true sentiment=· neutral

Weight-Space Geometry of Offline Reasoning Training

Researchers compared six offline reinforcement-learning methods for distilling reasoning from large language models into smaller ones, finding that SFT, RFT, and RIFT produce nearly identical weight updates and accuracy, while DPO achieves the highest accuracy (93.5% on GSM8K, 30.0% on AIME26) but occupies a near-orthogonal weight subspace with a mode-connectivity barrier. The study highlights mechanistic differences among methods despite similar training data.

read1 min views6 publishedJun 24, 2026

arXiv:2606.23740v1 Announce Type: new Abstract: Offline reinforcement-learning losses (RFT, RIFT, DFT, Offline GRPO, DPO) are widely used to distill reasoning from large teachers into smaller students, and are typically compared on downstream accuracy alone. We ask whether they are mechanistically distinct or converge to a similar weight update. Training six methods (SFT, RFT, DFT, RIFT, Offline GRPO, DPO) on identical math rollouts from a single base model (Qwen3-4B) with attention-only LoRA, we analyze the resulting deltas via cosine similarity, principal-angle subspace analysis, linear mode connectivity, and CKA. We observe: (i) SFT, RFT, and RIFT have nearly colinear weight deltas (cosine >= 0.97, top-1 principal angle ~7 deg median over 144 modules) and comparable GSM8K accuracy (87-88%, n=1319; pairwise McNemar p >= 0.15); (ii) DFT diverges further in direction than any reward-weighted method despite using the same data; (iii) Offline GRPO adds a substantial component orthogonal to the SFT direction (~67% globally, up to ~86% in late layers) while staying in the SFT loss basin; (iv) DPO sits in a near-orthogonal subspace, shows a mode-connectivity barrier, and collapses late-layer CKA to ~0.46. DPO also reaches the highest accuracy in our protocol on both GSM8K (93.5%, McNemar p < 10^-9 vs. each other method) and AIME26 (30.0% vs. 3.3-10.0%); its training uses a 10x smaller learning rate than the others (the standard convention), so the update-norm and accuracy gaps reflect loss-function and optimizer choices jointly, and a learning-rate-matched DPO comparison is left for future work.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/weight-space-geometry-of…

Read original on arxiv.org → arxiv.org/abs/2606.23740

mentioned entities

Qwen3-4B

GSM8K

AIME26

SFT

RFT

DFT

RIFT

DPO

metadata

slugweight-space-geometry-of-offline-reasoning-training

topic#machine-learning

secondary2 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevStop coding agents from writing …

next →Zhipu considers multibillion-dol…

── more in #machine-learning 4 stories · sorted by recency

arxiv.org · 27 May · #machine-learning

ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling

arxiv.org · 23 Jun · #machine-learning

VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

FareedKhan-dev.github.io · 21 Jun · #machine-learning

Train LLM from Scratch

dev.to · 20 Jun · #machine-learning

60–95% fewer tokens in your agent loops, same answers. Meet Headroom.

── more on @qwen3-4b 3 stories trending now

wpnews · 22 Jun · #generative-ai

Bain tests software takeover targets using vibecoding AI replicas

wpnews · 28 May · #ai-startups

The Niche SaaS Opportunity Map 2026: Highly Demanded Subscribed Categories Beyond Mainstream

wpnews · 24 Jun · #ai-policy

An AI startup is suing the US government for taking away Anthropic's new model

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required