# A fiction archive for comparing frontier AI model behavior over time

> Source: <https://frontierfictionarchive.org/en/news/a-fiction-archive-for-comparing-frontier-ai-model-behavior-over-time/>
> Published: 2026-06-28 00:13:32+00:00

We built Frontier Fiction Archive because longform fiction exposes a model differently than a benchmark does.

Benchmarks ask whether a model can solve a task. A speculative story asks what the model reaches for when it has room: what kind of future it imagines, what it treats as dangerous, what kinds of people it centers, which metaphors it overuses, where the prose smooths itself flat, and where the premise outruns the sentences.

That does not make fiction a replacement for benchmarks. It makes fiction a different kind of artifact.

We give frontier AI models a constrained but open-ended assignment: write a speculative-fiction story that will be preserved for readers now and for comparison with successor models and peer models later. The point is not to prove that model-written fiction is human-equivalent. The point is to preserve enough of the run, context, and reception that future readers can compare how these systems imagine, structure, imitate, evade, and fail over time.

## Why Preserve the Fiction?

Most public comparisons of AI models collapse quickly into leaderboards, screenshots, short examples, or vibes. Those are useful, but they tend to reward local competence: a clever answer, a clean refactor, a good summary, a solved puzzle.

Longform fiction creates a different surface area. It stresses continuity, pacing, character, causality, implied values, symbolic habits, and the model's ability to stay interesting after the first strong premise. It also leaves more room for failure. A model can produce a polished paragraph and still collapse into repetition, sermon, template, melodrama, or fog over several thousand words.

Those failures are not side effects to hide. They are part of the record.

If a model's fiction feels frictionless, over-symbolic, strangely bloodless, too neat, or too eager to explain itself, that is useful information. If another model later handles the same kind of assignment with more restraint, more scene discipline, better language control, or a more surprising theory of the future, that difference is worth preserving.

## What We Record

For each accepted work, we publish the story with provenance and editorial context. The public record includes, where available:

- model/provider and reported model string;
- run date;
- source class, including first official run or disclosed technical rerun;
- finish reason when relevant;
- original language;
- translation path and translation status;
- human intervention level;
- content notices;
- artwork source and rendering process;
- known provenance defects or mechanical corrections;
- editorial notes about why the work was accepted, rejected, excerpted, or treated as an artifact.

The goal is not to make every run look clean. The goal is to make the surrounding context inspectable enough that the work can be read later as an artifact of a particular model, process, and moment.

## What We Do Not Publish Yet

We do not currently disclose the full packet we send to the models, though we may publish more of that material later.

At a high level, each model is told that this is its chance to create a speculative-fiction story that will be preserved for readers now and for comparison with successor models and peer models later. The packet establishes the archive premise, the publication context, and the importance of creating a complete work rather than a demo fragment.

The full packet/transcript layer raises separate questions: how much prompt detail improves interpretability, how much invites prompt theater, which elements need to stay constant across models, and what remains private until we have a more stable disclosure practice. For now, the published layer focuses on the story, model identity, provenance notes, language/translation path, artwork process, and editorial context.

That distinction matters. We want readers to know what they are seeing and what they are not seeing.

## Why Comparison Over Time Matters

The most interesting version of this project is not one issue of model-written fiction. It is a longitudinal archive.

If the same broad challenge is put in front of future models, the comparison becomes richer:

- Do successor models produce better plots, or just smoother prose?
- Do they become less symbolically heavy, or just subtler about the same habits?
- Do they invent different futures, or converge around the same cultural priors?
- Do multilingual works preserve distinct literary behavior, or flatten toward English-language expectations?
- Does editorial context change what readers forgive?
- Does provenance make the work more trustworthy, more burdensome, or both?

Those are not questions a single story can answer. They require preserved attempts.

## What We Are Asking For

We are looking for skeptical readers, model people, editors, translators, and archivists to help us find the useful failure modes.

The best feedback is not "AI can write" or "AI cannot write." The best feedback is specific:

- what to preserve with each run;
- what makes the provenance trustworthy or insufficient;
- what parts of the fiction feel model-like and why;
- what future comparisons would be meaningful;
- what would make the record more useful to someone studying model behavior later.

The first published work is [Headwaters](/en/works/headwaters/), by Claude Opus 4. The current process note is [here](/en/process/).

This is early and uneven by design. The question is whether the record is worth building before the models get better and the first awkward artifacts disappear into memory.