# A 13 KB text file beat a smarter model: benchmarking AI codegen across 5 Angular state libraries

> Source: <https://dev.to/jborgia/a-13-kb-text-file-beat-a-smarter-model-benchmarking-ai-codegen-across-5-angular-state-libraries-3p36>
> Published: 2026-05-30 00:37:51+00:00

**Disclosure up front:** I maintain one of the five libraries tested (SignalTree), and it's the one that scored *worst* in the cold run — so this isn't a "look how good my thing is" post. The cross-library pattern and the fix were interesting enough that I wanted to put the numbers in front of people who use Copilot/Cursor/Claude Code every day. The whole harness is reproducible (one command, link at the bottom); I'd rather it get torn apart than taken on faith.

**What this measures: one-shot generation.** The agent gets the prompt, returns a file, we score it. Real interactive use — Cursor/Copilot with chat back-and-forth, where the model sees its own errors and gets a second try — is a different setting, and the lift could be larger or smaller there. This is the cold-shot case.

No context provided, just "write this in library X":

| Library | Cold score |
|---|---|
| Akita | 94% |
| Elf | 94% |
| NgRx (classic) | 91% |
| NgRx SignalStore | 86% |
| SignalTree | 49% |

The libraries that have been around for years, with thousands of blog posts and Stack Overflow answers, score in the 90s. The youngest/smallest library in the set scores ~49%. That gap isn't really a quality signal — it's a *corpus* signal. The models have simply seen orders of magnitude more Akita than SignalTree. Worth keeping in mind any time you judge a library by how well your AI assistant writes it cold: you're partly measuring its age, not its design.

I shipped a ~13.5 KB `llms.txt`

(a plain-text API summary) inside the npm package and re-ran with it in context:

| Mode | SignalTree score |
|---|---|
| Cold | 49% |
+ `llms.txt` (13.5 KB) |
91% |
+ `llms.txt` + extra notes (~25 KB) |
87% |

+42 percentage points from one small file — enough to pull the least-known library up into the range of the well-established ones. Two things I didn't expect:

| Library | Cold | With SignalTree's context loaded |
|---|---|---|
| SignalTree | 49 | 91 |
| NgRx (classic) | 91 | 88 |
| NgRx SignalStore | 86 | 80 |
| Akita | 94 | 85 |
| Elf | 94 | 87 |

**Practical takeaway: more context is not better.** Past ~15 KB the numbers went down, not up. If you maintain or use a less-common library, a small retrievable context file does more for codegen accuracy than reaching for a "smarter" model — primed mid-tier models beat cold top-tier ones in my runs — but dumping your whole docs site in backfires.

The failures weren't random. Agents kept calling methods that didn't exist, and the pattern pointed straight at my own inconsistency — I'd named predicate accessors two different ways across the API:

```
// some markers used an is- prefix
saveStatus.isLoading()
users.isEmpty()

// others used bare names
profile.dirty()
feed.loading()
```

An agent that learned `isLoading()`

would confidently try `isDirty()`

, which never existed. That's not an AI failure — it's a human one wearing an AI costume. Any developer reading the docs hits the same wall; they just fail more quietly and blame themselves. I standardized on bare names (matching `FormControl.dirty`

/`.valid`

), kept the old names as deprecated aliases, shipped it.

The generalizable takeaway, and the reason I think this is worth writing up rather than burying in a changelog: **an API surface a model can't keep straight is usually one a human can't either.** Codegen accuracy turns out to be a surprisingly good proxy for naming consistency, and a cheap one to measure.

I'd rather list the holes than have them found, so here are the three I'd lead with:

And one that's less a flaw than a "yeah, obviously": **cold score ≈ training-data volume is barely a finding** — it's close to a truism once you say it out loud. The only mildly non-obvious part is *how cheaply* a retrievable file substitutes for years of corpus presence.

One OpenRouter key, ~$15, ~30 minutes:

```
git clone https://github.com/JBorgia/signaltree
export OPENROUTER_API_KEY=sk-or-...
node scripts/ai-codegen-benchmark/runner.mjs
```

Prompts (YAML), scoring rubric, adapters, and per-cell results all live in `scripts/ai-codegen-benchmark/`

. The prompts and rubric are the parts most worth disagreeing with — if you spot one that's unfair to a particular library, that's the most useful feedback I can get.

For those of you using Copilot / Cursor / Claude Code daily: when the generated code for a library is bad, **what's actually fixed it for you** — a custom rules file, pasted docs, an MCP server, something else? I'm especially curious whether the "ship a small context file" result holds outside my own setup, or whether interactive back-and-forth makes it moot.