cd /news/large-language-models/a-13-kb-text-file-beat-a-smarter-mod… Β· home β€Ί topics β€Ί large-language-models β€Ί article
[ARTICLE Β· art-18235] src=dev.to pub= topic=large-language-models verified=true sentiment=Β· neutral

A 13 KB text file beat a smarter model: benchmarking AI codegen across 5 Angular state libraries

A developer benchmarked AI code generation across five Angular state management libraries and found that a 13 KB text file boosted the worst-performing library's score from 49% to 91%, matching established libraries with years of training data. The test revealed that smaller, targeted context files improved codegen accuracy more effectively than larger documentation dumps, and that AI failures often reflected inconsistent API naming rather than model limitations.

read4 min publishedMay 30, 2026

Disclosure up front: I maintain one of the five libraries tested (SignalTree), and it's the one that scored worst in the cold run β€” so this isn't a "look how good my thing is" post. The cross-library pattern and the fix were interesting enough that I wanted to put the numbers in front of people who use Copilot/Cursor/Claude Code every day. The whole harness is reproducible (one command, link at the bottom); I'd rather it get torn apart than taken on faith.

What this measures: one-shot generation. The agent gets the prompt, returns a file, we score it. Real interactive use β€” Cursor/Copilot with chat back-and-forth, where the model sees its own errors and gets a second try β€” is a different setting, and the lift could be larger or smaller there. This is the cold-shot case.

No context provided, just "write this in library X":

Library Cold score
Akita 94%
Elf 94%
NgRx (classic) 91%
NgRx SignalStore 86%
SignalTree 49%

The libraries that have been around for years, with thousands of blog posts and Stack Overflow answers, score in the 90s. The youngest/smallest library in the set scores ~49%. That gap isn't really a quality signal β€” it's a corpus signal. The models have simply seen orders of magnitude more Akita than SignalTree. Worth keeping in mind any time you judge a library by how well your AI assistant writes it cold: you're partly measuring its age, not its design.

I shipped a ~13.5 KB llms.txt

(a plain-text API summary) inside the npm package and re-ran with it in context:

Mode SignalTree score
Cold 49%
  • llms.txt (13.5 KB) | 91% |
  • llms.txt + extra notes (~25 KB) | 87% |

+42 percentage points from one small file β€” enough to pull the least-known library up into the range of the well-established ones. Two things I didn't expect:

Library Cold With SignalTree's context loaded
SignalTree 49 91
NgRx (classic) 91 88
NgRx SignalStore 86 80
Akita 94 85
Elf 94 87

Practical takeaway: more context is not better. Past ~15 KB the numbers went down, not up. If you maintain or use a less-common library, a small retrievable context file does more for codegen accuracy than reaching for a "smarter" model β€” primed mid-tier models beat cold top-tier ones in my runs β€” but dumping your whole docs site in backfires.

The failures weren't random. Agents kept calling methods that didn't exist, and the pattern pointed straight at my own inconsistency β€” I'd named predicate accessors two different ways across the API:

// some markers used an is- prefix
saveStatus.is()
users.isEmpty()

// others used bare names
profile.dirty()
feed.()

An agent that learned is()

would confidently try isDirty()

, which never existed. That's not an AI failure β€” it's a human one wearing an AI costume. Any developer reading the docs hits the same wall; they just fail more quietly and blame themselves. I standardized on bare names (matching FormControl.dirty

/.valid

), kept the old names as deprecated aliases, shipped it.

The generalizable takeaway, and the reason I think this is worth writing up rather than burying in a changelog: an API surface a model can't keep straight is usually one a human can't either. Codegen accuracy turns out to be a surprisingly good proxy for naming consistency, and a cheap one to measure.

I'd rather list the holes than have them found, so here are the three I'd lead with:

And one that's less a flaw than a "yeah, obviously": cold score β‰ˆ training-data volume is barely a finding β€” it's close to a truism once you say it out loud. The only mildly non-obvious part is how cheaply a retrievable file substitutes for years of corpus presence.

One OpenRouter key, ~$15, ~30 minutes:

git clone https://github.com/JBorgia/signaltree
export OPENROUTER_API_KEY=sk-or-...
node scripts/ai-codegen-benchmark/runner.mjs

Prompts (YAML), scoring rubric, adapters, and per-cell results all live in scripts/ai-codegen-benchmark/

. The prompts and rubric are the parts most worth disagreeing with β€” if you spot one that's unfair to a particular library, that's the most useful feedback I can get.

For those of you using Copilot / Cursor / Claude Code daily: when the generated code for a library is bad, what's actually fixed it for you β€” a custom rules file, pasted docs, an MCP server, something else? I'm especially curious whether the "ship a small context file" result holds outside my own setup, or whether interactive back-and-forth makes it moot.

── more in #large-language-models 4 stories Β· sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/a-13-kb-text-file-be…] indexed:0 read:4min 2026-05-30 Β· β€”