cd /news/large-language-models/narrativeworldbench-a-frontier-satur… · home topics large-language-models article
[ARTICLE · art-30532] src=arxiv.org ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama

Researchers introduced NarrativeWorldBench, a benchmark for long-horizon co-creative audio drama, and N-VSSM, a latent world model that outperforms frontier LLMs like Claude Opus 4.5 on long-arc consistency by 71% in a writer study. N-VSSM maintains plot-beat F1 above 0.84 across 200 episodes at 4x lower compute than closed-frontier systems, with cross-lingual gains of +0.20 to +0.23 Likert points across four Indic languages.

read1 min views1 publishedJun 17, 2026

arXiv:2606.17391v1 Announce Type: new Abstract: Long-form serialized audio drama, with arcs that run for 200 to 800 episodes, is a major creative medium and a setting where frontier large language models (LLMs) fail. We benchmark 21 models, spanning classical, fine-tuned, open-frontier, closed-frontier, and reasoning tiers, on a uniform set of structural narrative metrics. All closed-frontier systems saturate at a plot-beat F1 in the band [0.78, 0.81] and collapse by about -0.20 F1 at horizon h=200. We introduce NarrativeWorldBench, an open benchmark of nine narrative-structure metrics evaluated across horizons h in {10, 20, 50, 100, 200}, with cross-lingual evaluation across four Indic languages (Hindi, Tamil, Telugu, Marathi). We introduce N-VSSM, a Narrative Variational State-Space Model that maintains a structured 256-dimensional latent world state over more than 200 episodes via a Mamba-2 backbone with an event-conditioned posterior and an 8B decoder. N-VSSM holds plot-beat F1 >= 0.84 across all horizons at 4x lower compute than the closed-frontier band. A learned Cultural Transfer Function lifts cross-language fidelity by +0.20 to +0.23 Likert points. In a within-subjects writer study (n = 12 professional authors, 240 trials), N-VSSM is preferred over Claude Opus 4.5 on long-arc consistency 71% of the time and rated +1.3 Likert points higher on controllability.

── more in #large-language-models 4 stories · sorted by recency
── more on @narrativeworldbench 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/narrativeworldbench-…] indexed:0 read:1min 2026-06-17 ·