Disclosure up front: I maintain one of the five libraries tested (SignalTree), and it's the one that scored worst in the cold run β so this isn't a "look how good my thing is" post. The cross-library pattern and the fix were interesting enough that I wanted to put the numbers in front of people who use Copilot/Cursor/Claude Code every day. The whole harness is reproducible (one command, link at the bottom); I'd rather it get torn apart than taken on faith.
What this measures: one-shot generation. The agent gets the prompt, returns a file, we score it. Real interactive use β Cursor/Copilot with chat back-and-forth, where the model sees its own errors and gets a second try β is a different setting, and the lift could be larger or smaller there. This is the cold-shot case.
No context provided, just "write this in library X":
| Library | Cold score |
|---|---|
| Akita | 94% |
| Elf | 94% |
| NgRx (classic) | 91% |
| NgRx SignalStore | 86% |
| SignalTree | 49% |
The libraries that have been around for years, with thousands of blog posts and Stack Overflow answers, score in the 90s. The youngest/smallest library in the set scores ~49%. That gap isn't really a quality signal β it's a corpus signal. The models have simply seen orders of magnitude more Akita than SignalTree. Worth keeping in mind any time you judge a library by how well your AI assistant writes it cold: you're partly measuring its age, not its design.
I shipped a ~13.5 KB llms.txt
(a plain-text API summary) inside the npm package and re-ran with it in context:
| Mode | SignalTree score |
|---|---|
| Cold | 49% |
llms.txt(13.5 KB) | 91% |llms.txt+ extra notes (~25 KB) | 87% |
+42 percentage points from one small file β enough to pull the least-known library up into the range of the well-established ones. Two things I didn't expect:
| Library | Cold | With SignalTree's context loaded |
|---|---|---|
| SignalTree | 49 | 91 |
| NgRx (classic) | 91 | 88 |
| NgRx SignalStore | 86 | 80 |
| Akita | 94 | 85 |
| Elf | 94 | 87 |
Practical takeaway: more context is not better. Past ~15 KB the numbers went down, not up. If you maintain or use a less-common library, a small retrievable context file does more for codegen accuracy than reaching for a "smarter" model β primed mid-tier models beat cold top-tier ones in my runs β but dumping your whole docs site in backfires.
The failures weren't random. Agents kept calling methods that didn't exist, and the pattern pointed straight at my own inconsistency β I'd named predicate accessors two different ways across the API:
// some markers used an is- prefix
saveStatus.is()
users.isEmpty()
// others used bare names
profile.dirty()
feed.()
An agent that learned is()
would confidently try isDirty()
, which never existed. That's not an AI failure β it's a human one wearing an AI costume. Any developer reading the docs hits the same wall; they just fail more quietly and blame themselves. I standardized on bare names (matching FormControl.dirty
/.valid
), kept the old names as deprecated aliases, shipped it.
The generalizable takeaway, and the reason I think this is worth writing up rather than burying in a changelog: an API surface a model can't keep straight is usually one a human can't either. Codegen accuracy turns out to be a surprisingly good proxy for naming consistency, and a cheap one to measure.
I'd rather list the holes than have them found, so here are the three I'd lead with:
And one that's less a flaw than a "yeah, obviously": cold score β training-data volume is barely a finding β it's close to a truism once you say it out loud. The only mildly non-obvious part is how cheaply a retrievable file substitutes for years of corpus presence.
One OpenRouter key, ~$15, ~30 minutes:
git clone https://github.com/JBorgia/signaltree
export OPENROUTER_API_KEY=sk-or-...
node scripts/ai-codegen-benchmark/runner.mjs
Prompts (YAML), scoring rubric, adapters, and per-cell results all live in scripts/ai-codegen-benchmark/
. The prompts and rubric are the parts most worth disagreeing with β if you spot one that's unfair to a particular library, that's the most useful feedback I can get.
For those of you using Copilot / Cursor / Claude Code daily: when the generated code for a library is bad, what's actually fixed it for you β a custom rules file, pasted docs, an MCP server, something else? I'm especially curious whether the "ship a small context file" result holds outside my own setup, or whether interactive back-and-forth makes it moot.