NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama

wpnews.pro

cd /news/large-language-models/narrativeworldbench-a-frontier-satur… · home › topics › large-language-models › article

[ARTICLE · art-30532] src=arxiv.org ↗ pub=2026-06-17T04:00Z topic=large-language-models verified=true sentiment=↑ positive

NarrativeWorldBench: A Frontier-Saturated Benchmark and a Latent World Model for Long-Horizon Co-Creative Audio Drama

Researchers introduced NarrativeWorldBench, a benchmark for long-horizon co-creative audio drama, and N-VSSM, a latent world model that outperforms frontier LLMs like Claude Opus 4.5 on long-arc consistency by 71% in a writer study. N-VSSM maintains plot-beat F1 above 0.84 across 200 episodes at 4x lower compute than closed-frontier systems, with cross-lingual gains of +0.20 to +0.23 Likert points across four Indic languages.

read1 min views1 publishedJun 17, 2026

arXiv:2606.17391v1 Announce Type: new Abstract: Long-form serialized audio drama, with arcs that run for 200 to 800 episodes, is a major creative medium and a setting where frontier large language models (LLMs) fail. We benchmark 21 models, spanning classical, fine-tuned, open-frontier, closed-frontier, and reasoning tiers, on a uniform set of structural narrative metrics. All closed-frontier systems saturate at a plot-beat F1 in the band [0.78, 0.81] and collapse by about -0.20 F1 at horizon h=200. We introduce NarrativeWorldBench, an open benchmark of nine narrative-structure metrics evaluated across horizons h in {10, 20, 50, 100, 200}, with cross-lingual evaluation across four Indic languages (Hindi, Tamil, Telugu, Marathi). We introduce N-VSSM, a Narrative Variational State-Space Model that maintains a structured 256-dimensional latent world state over more than 200 episodes via a Mamba-2 backbone with an event-conditioned posterior and an 8B decoder. N-VSSM holds plot-beat F1 >= 0.84 across all horizons at 4x lower compute than the closed-frontier band. A learned Cultural Transfer Function lifts cross-language fidelity by +0.20 to +0.23 Likert points. In a within-subjects writer study (n = 12 professional authors, 240 trials), N-VSSM is preferred over Claude Opus 4.5 on long-arc consistency 71% of the time and rated +1.3 Likert points higher on controllability.

source & further reading

arxiv.org — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/narrativeworldbench-a-fr…

Read original on arxiv.org → arxiv.org/abs/2606.17391

mentioned entities

NarrativeWorldBench

N-VSSM

Claude Opus 4.5

Mamba-2

metadata

slugnarrativeworldbench-a-frontier-saturated-benchmark-and-a-latent-world-model-for

topic#large-language-models

secondary2 topics

sentimentpositive

canonicalarxiv.org

navigation

← prevRay Data LLM enables 2x throughp…

next →Claude Agent SDK Permissions: An…

── more in #large-language-models 4 stories · sorted by recency

arxiv.org · 17 Jun · #large-language-models

The Slop Paradox

tenureai.dev · 17 Jun · #large-language-models

AI memory systems break at scale

letsdatascience.com · 17 Jun · #large-language-models

ChatGPT Expands Voice Input to 70+ Languages

arxiv.org · 17 Jun · #large-language-models

When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval

── more on @narrativeworldbench 3 stories trending now

wpnews · 16 Jun · #ai-agents

The LLM Is Not the Final Authority: Building Trust Infrastructure for AI Agents

wpnews · 16 Jun · #artificial-intelligence

Most Businesses Lose Leads at Night — So I Built This

wpnews · 16 Jun · #ai-safety

Researchers propose causal framework to audit synthetic data

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required