04:00
2026-06-19
arxiv.org
large-language-models
Characterizing Narrative Content in Web-scale LLM Pretraining Data
Researchers at the University of Washington and Allen Institute for AI conducted the first fine-grained study of narrative features in Dolma, a 3-trillion-token open LLM pretraining corpus. They develβ¦