Where Does Social Reasoning Come From? Capability Provenance in Language Models

wpnews.pro

cd /news/large-language-models/where-does-social-reasoning-come-fro… · home › topics › large-language-models › article

[ARTICLE · art-33547] src=arxiv.org ↗ pub=2026-06-19T04:00Z topic=large-language-models verified=true sentiment=· neutral

Where Does Social Reasoning Come From? Capability Provenance in Language Models

Researchers at the Allen Institute for AI used training-data attribution to map which regions of the pretraining corpus support social-reasoning versus STEM-reasoning in the OLMo3-7B language model. They found that social and STEM reasoning draw on distinct corpus regions, with sharper contrasts at the reasoning level than at the knowledge level. Targeted machine unlearning of high-attribution topic bins degraded aligned benchmarks, providing causal validation.

read1 min views1 publishedJun 19, 2026

arXiv:2606.19625v1 Announce Type: new Abstract: We use training-data attribution as an interpretable tool for capability discovery, mapping which regions of the pretraining corpus support social-reasoning versus STEM-reasoning in OLMo3-7B. Training-data attribution measures how strongly each training document influences a model's predictions on a benchmark, but document-level scores are too noisy to identify which corpus regions support which capabilities, and prior work has emphasized factual knowledge rather than reasoning. We compute gradient-based attribution (TrackStar via Bergson) over a working set drawn from the de-duplicated Dolma3 mix, aggregate influence across WebOrganizer's 24-format x 24-topic taxonomy (576 bins), and contrast benchmark pairs in a 2x2 design that varies domain (social vs. STEM) and capability type (reasoning vs. knowledge): SocialIQA and MMLU Social Sciences against ARC-Challenge and MMLU STEM. Social and STEM reasoning draw on qualitatively distinct corpus regions, and the contrast is sharper at the reasoning level than at the knowledge level. Targeted machine unlearning provides partial causal validation: forgetting high-attribution topic bins (e.g., Literature for SocialIQA) degrades the aligned benchmark more than within-bin random baselines, and we open-source all code, sampling manifests, the bin-level influence matrix, and unlearning checkpoints.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/where-does-social-reason…

Read original on arxiv.org → arxiv.org/abs/2606.19625

mentioned entities

Allen Institute for AI

OLMo3-7B

Dolma3

WebOrganizer

SocialIQA

MMLU

ARC-Challenge

TrackStar

metadata

slugwhere-does-social-reasoning-come-from-capability-provenance-in-language-models

topic#large-language-models

secondary3 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevNewegg deal drops RTX 5060 Ti 16…

next →Stop Saying "It Works on My Mach…

── more in #large-language-models 4 stories · sorted by recency

dev.to · 19 Jun · #large-language-models

I Added a Verify Layer to My Local RAG to Catch Hallucinations. It Caught Me Being Wrong Twice About My Own Corpus

arxiv.org · 19 Jun · #large-language-models

Characterizing Narrative Content in Web-scale LLM Pretraining Data

arxiv.org · 19 Jun · #large-language-models

Quantifying Aleatoric Uncertainty of In-Context Learning for Robust Measure of LLM Prediction Confidence

arxiv.org · 19 Jun · #large-language-models

Closing the Social-Semantic Gap: SPSD for Edge-Based Prompt Compression in Cloud LLM Inference

── more on @allen institute for ai 3 stories trending now

wpnews · 18 Jun · #large-language-models

ICYMI: ZAI launches GLM-5.2 open model with 1M context

wpnews · 18 Jun · #ai-chips

Apple and Intel join forces in Trump’s push to bring chipmaking home

wpnews · 18 Jun · #ai-agents

How to Automate Business Reports With an AI Agent Instead of Dashboards

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required