{"slug": "narrativeworldbench-a-frontier-saturated-benchmark-and-a-latent-world-model-for", "title": "NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama", "summary": "Researchers introduced NarrativeWorldBench, a benchmark for long-horizon co-creative audio drama, and N-VSSM, a latent world model that outperforms frontier LLMs like Claude Opus 4.5 on long-arc consistency by 71% in a writer study. N-VSSM maintains plot-beat F1 above 0.84 across 200 episodes at 4x lower compute than closed-frontier systems, with cross-lingual gains of +0.20 to +0.23 Likert points across four Indic languages.", "body_md": "arXiv:2606.17391v1 Announce Type: new\nAbstract: Long-form serialized audio drama, with arcs that run for 200 to 800 episodes, is a major creative medium and a setting where frontier large language models (LLMs) fail. We benchmark 21 models, spanning classical, fine-tuned, open-frontier, closed-frontier, and reasoning tiers, on a uniform set of structural narrative metrics. All closed-frontier systems saturate at a plot-beat F1 in the band [0.78, 0.81] and collapse by about -0.20 F1 at horizon h=200. We introduce NarrativeWorldBench, an open benchmark of nine narrative-structure metrics evaluated across horizons h in {10, 20, 50, 100, 200}, with cross-lingual evaluation across four Indic languages (Hindi, Tamil, Telugu, Marathi). We introduce N-VSSM, a Narrative Variational State-Space Model that maintains a structured 256-dimensional latent world state over more than 200 episodes via a Mamba-2 backbone with an event-conditioned posterior and an 8B decoder. N-VSSM holds plot-beat F1 >= 0.84 across all horizons at 4x lower compute than the closed-frontier band. A learned Cultural Transfer Function lifts cross-language fidelity by +0.20 to +0.23 Likert points. In a within-subjects writer study (n = 12 professional authors, 240 trials), N-VSSM is preferred over Claude Opus 4.5 on long-arc consistency 71% of the time and rated +1.3 Likert points higher on controllability.", "url": "https://wpnews.pro/news/narrativeworldbench-a-frontier-saturated-benchmark-and-a-latent-world-model-for", "canonical_source": "https://arxiv.org/abs/2606.17391", "published_at": "2026-06-17 04:00:00+00:00", "updated_at": "2026-06-17 04:27:37.358760+00:00", "lang": "en", "topics": ["large-language-models", "natural-language-processing", "ai-research"], "entities": ["NarrativeWorldBench", "N-VSSM", "Claude Opus 4.5", "Mamba-2"], "alternates": {"html": "https://wpnews.pro/news/narrativeworldbench-a-frontier-saturated-benchmark-and-a-latent-world-model-for", "markdown": "https://wpnews.pro/news/narrativeworldbench-a-frontier-saturated-benchmark-and-a-latent-world-model-for.md", "text": "https://wpnews.pro/news/narrativeworldbench-a-frontier-saturated-benchmark-and-a-latent-world-model-for.txt", "jsonld": "https://wpnews.pro/news/narrativeworldbench-a-frontier-saturated-benchmark-and-a-latent-world-model-for.jsonld"}}