# What 'quality-tested' actually means for a library of 394 AI skills

> Source: <https://dev.to/urgrue/what-quality-tested-actually-means-for-a-library-of-394-ai-skills-g0d>
> Published: 2026-06-30 15:25:49+00:00

"Quality-tested" is the kind of phrase that usually means nothing. Every tool claims it. Most mean "we tried it once and it didn't crash." So when a library of 394 free Claude skills puts "all quality-tested, mean 4.38/5" on the tin, the fair response is: prove it.

Here's exactly what the claim means, including where it's soft.

`stable`

only if it clears two bars
Every skill carries a status. To reach `stable`

— the only status the library promotes — it has to pass a seven-dimension evaluation:

The library mean across all stable skills is 4.38. The whole framework — dimensions, thresholds, the banned-phrase list — is in the repo, not a marketing page.

Code eval is binary: it runs or it doesn't. Prose has no green checkmark, so the library tests in two layers. First, **binary assertions** catch the mechanical failures — did it produce the required sections, did it refuse to fabricate a quote with no source. Across thousands of these the pass rate is high, and the few "failures" were skills *correctly* refusing to invent content on deliberately thin inputs — the behaviour you want. Second, the **graded rubric** above handles the judgment calls binary checks can't.

The graded scoring uses a model as judge, and models are generous: they tend to like fluent text, including fluent AI text. So the scores are treated as a **filter, not a verdict**. Three things keep them honest:

It is not a guarantee every output is perfect. It's a documented, repeatable bar that's a lot higher than "we tried it once."

Because the audience is media professionals, and they detect generic instantly. A skill library for people who notice bad writing has to be testable on exactly that axis, or the whole premise collapses. The eval framework isn't a credential — it's the thing that makes "doesn't sound like AI" a claim you can check instead of a vibe.

Open the repo, open any skill, read its example, and judge for yourself. That's the test that matters.
