Characterizing Narrative Content in Web-scale LLM Pretraining Data

wpnews.pro

cd /news/large-language-models/characterizing-narrative-content-in-… · home › topics › large-language-models › article

[ARTICLE · art-33544] src=arxiv.org ↗ pub=2026-06-19T04:00Z topic=large-language-models verified=true sentiment=· neutral

Characterizing Narrative Content in Web-scale LLM Pretraining Data

Researchers at the University of Washington and Allen Institute for AI conducted the first fine-grained study of narrative features in Dolma, a 3-trillion-token open LLM pretraining corpus. They developed NarraBERT, a RoBERTa-based model, to analyze narrative structure across 3 million passages, finding that narrative qualities are unequally distributed across pretraining sources and topics. The study highlights gaps in current data curation practices and provides a foundation for understanding how narrative composition affects LLM reasoning tasks.

read1 min views1 publishedJun 19, 2026

arXiv:2606.19468v1 Announce Type: new Abstract: The narrative composition of web-scale LLM pretraining corpora remains largely unexplored even though narrative is a fundamental mode of human communication. We present the first fine-grained study of narrative features in Dolma, a 3-trillion-token open pretraining corpus. Drawing on narrative theory, we design a framework spanning three core narrative elements (agency, setting, and events) operationalized as 11 interpretable dimensions. After sampling and annotating a diverse set of 400 passages, we finetune and validate NarraBERT, a RoBERTa-based model for fine-grained narrative prediction. We apply NarraBERT to 3M passages, resulting in a new dataset, NarraDolma. We find (i) narrative structure is measurable at scale across extremely heterogeneous data, (ii) we uncover a continuous, multidimensional narrative structure underlying web text, and (iii) narrative qualities are unequally distributed across pretraining sources and topics in ways that current curation practices neither measure nor account for. Our framework, dataset, and analyses provide a foundation for understanding how narrative qualities are distributed in LLM pretraining data and for studying how data composition affects narrative reasoning tasks. We publicly release NarraDolma and NarraBERT.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/characterizing-narrative…

Read original on arxiv.org → arxiv.org/abs/2606.19468

mentioned entities

University of Washington

Allen Institute for AI

Dolma

NarraBERT

RoBERTa

NarraDolma

metadata

slugcharacterizing-narrative-content-in-web-scale-llm-pretraining-data

topic#large-language-models

secondary2 topics

sentimentneutral

canonicalarxiv.org

navigation

← prevNewegg deal drops RTX 5060 Ti 16…

next →Stop Saying "It Works on My Mach…

── more in #large-language-models 4 stories · sorted by recency

arxiv.org · 19 Jun · #large-language-models

Where Does Social Reasoning Come From? Capability Provenance in Language Models

dev.to · 19 Jun · #large-language-models

I Added a Verify Layer to My Local RAG to Catch Hallucinations. It Caught Me Being Wrong Twice About My Own Corpus

arxiv.org · 19 Jun · #large-language-models

Closing the Social-Semantic Gap: SPSD for Edge-Based Prompt Compression in Cloud LLM Inference

arxiv.org · 19 Jun · #large-language-models

Diffusion Language Models: An Experimental Analysis

── more on @university of washington 3 stories trending now

wpnews · 18 Jun · #large-language-models

ICYMI: ZAI launches GLM-5.2 open model with 1M context

wpnews · 18 Jun · #ai-chips

Apple and Intel join forces in Trump’s push to bring chipmaking home

wpnews · 18 Jun · #ai-agents

How to Automate Business Reports With an AI Agent Instead of Dashboards

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required