A 13 KB text file beat a smarter model: benchmarking AI codegen across 5 Angular state libraries

wpnews.pro

cd /news/large-language-models/a-13-kb-text-file-beat-a-smarter-mod… · home › topics › large-language-models › article

[ARTICLE · art-18235] src=dev.to ↗ pub=2026-05-30T00:37Z topic=large-language-models verified=true sentiment=· neutral

A 13 KB text file beat a smarter model: benchmarking AI codegen across 5 Angular state libraries

A developer benchmarked AI code generation across five Angular state management libraries and found that a 13 KB text file boosted the worst-performing library's score from 49% to 91%, matching established libraries with years of training data. The test revealed that smaller, targeted context files improved codegen accuracy more effectively than larger documentation dumps, and that AI failures often reflected inconsistent API naming rather than model limitations.

read4 min views22 publishedMay 30, 2026

Disclosure up front: I maintain one of the five libraries tested (SignalTree), and it's the one that scored worst in the cold run — so this isn't a "look how good my thing is" post. The cross-library pattern and the fix were interesting enough that I wanted to put the numbers in front of people who use Copilot/Cursor/Claude Code every day. The whole harness is reproducible (one command, link at the bottom); I'd rather it get torn apart than taken on faith.

What this measures: one-shot generation. The agent gets the prompt, returns a file, we score it. Real interactive use — Cursor/Copilot with chat back-and-forth, where the model sees its own errors and gets a second try — is a different setting, and the lift could be larger or smaller there. This is the cold-shot case.

No context provided, just "write this in library X":

Library	Cold score
Akita	94%
Elf	94%
NgRx (classic)	91%
NgRx SignalStore	86%
SignalTree	49%

The libraries that have been around for years, with thousands of blog posts and Stack Overflow answers, score in the 90s. The youngest/smallest library in the set scores ~49%. That gap isn't really a quality signal — it's a corpus signal. The models have simply seen orders of magnitude more Akita than SignalTree. Worth keeping in mind any time you judge a library by how well your AI assistant writes it cold: you're partly measuring its age, not its design.

I shipped a ~13.5 KB llms.txt

(a plain-text API summary) inside the npm package and re-ran with it in context:

Mode	SignalTree score
Cold	49%

llms.txt (13.5 KB) | 91% |
llms.txt + extra notes (~25 KB) | 87% |

+42 percentage points from one small file — enough to pull the least-known library up into the range of the well-established ones. Two things I didn't expect:

Library	Cold	With SignalTree's context loaded
SignalTree	49	91
NgRx (classic)	91	88
NgRx SignalStore	86	80
Akita	94	85
Elf	94	87

Practical takeaway: more context is not better. Past ~15 KB the numbers went down, not up. If you maintain or use a less-common library, a small retrievable context file does more for codegen accuracy than reaching for a "smarter" model — primed mid-tier models beat cold top-tier ones in my runs — but dumping your whole docs site in backfires.

The failures weren't random. Agents kept calling methods that didn't exist, and the pattern pointed straight at my own inconsistency — I'd named predicate accessors two different ways across the API:

// some markers used an is- prefix
saveStatus.is()
users.isEmpty()

// others used bare names
profile.dirty()
feed.()

An agent that learned is()

would confidently try isDirty()

, which never existed. That's not an AI failure — it's a human one wearing an AI costume. Any developer reading the docs hits the same wall; they just fail more quietly and blame themselves. I standardized on bare names (matching FormControl.dirty

/.valid

), kept the old names as deprecated aliases, shipped it.

The generalizable takeaway, and the reason I think this is worth writing up rather than burying in a changelog: an API surface a model can't keep straight is usually one a human can't either. Codegen accuracy turns out to be a surprisingly good proxy for naming consistency, and a cheap one to measure.

I'd rather list the holes than have them found, so here are the three I'd lead with:

And one that's less a flaw than a "yeah, obviously": cold score ≈ training-data volume is barely a finding — it's close to a truism once you say it out loud. The only mildly non-obvious part is how cheaply a retrievable file substitutes for years of corpus presence.

One OpenRouter key, ~$15, ~30 minutes:

git clone https://github.com/JBorgia/signaltree
export OPENROUTER_API_KEY=sk-or-...
node scripts/ai-codegen-benchmark/runner.mjs

Prompts (YAML), scoring rubric, adapters, and per-cell results all live in scripts/ai-codegen-benchmark/

. The prompts and rubric are the parts most worth disagreeing with — if you spot one that's unfair to a particular library, that's the most useful feedback I can get.

For those of you using Copilot / Cursor / Claude Code daily: when the generated code for a library is bad, what's actually fixed it for you — a custom rules file, pasted docs, an MCP server, something else? I'm especially curious whether the "ship a small context file" result holds outside my own setup, or whether interactive back-and-forth makes it moot.

source & further reading

dev.to — original article Building a Robust RAG Pipeline Architecture for Production AI-Assisted Coding: Is It Dullating Developer Skills? Harmonic mixing over MCP: the DJ set-builder Spotify never shipped

~/api · this article 200

$curl api.wpnews.pro/v1/news/a-13-kb-text-file-beat-a…

Read original on dev.to → dev.to/jborgia/a-13-kb-text-file-beat-a-smarter-…

mentioned entities

SignalTree

Akita

Elf

NgRx

Copilot

Cursor

Claude Code

Stack Overflow

metadata

sluga-13-kb-text-file-beat-a-smarter-model-benchmarking-ai-codegen-across-5-angular

topic#large-language-models

secondary3 topics

sentimentneutral

canonicaldev.to

navigation

← prevAI Code Drift in the Wild: A Sca…

next →I'm 15, Built My First Real Proj…

── more in #large-language-models 4 stories · sorted by recency

slidelang.com · 14 Jul · #large-language-models

Show HN: Translate Slides and Images

machinebrief.com · 14 Jul · #large-language-models

Thomson Reuters Just Fired 500 Engineers to Hire AI-Native Ones

fastcompany.com · 14 Jul · #large-language-models

5 ways to use AI to sharpen your thinking

machinebrief.com · 14 Jul · #large-language-models

LLM-PDESR: A Breakthrough in Discovering PDEs from Noisy Data

── more on @signaltree 3 stories trending now

wpnews · 23 May · #artificial-intelligence

AccessLens — a blind person's lanyard, powered by Gemma 4 on-device

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 21 May · #developer-tools

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required