{"slug": "what-quality-tested-actually-means-for-a-library-of-394-ai-skills", "title": "What 'quality-tested' actually means for a library of 394 AI skills", "summary": "A library of 394 free Claude skills claims 'all quality-tested, mean 4.38/5', backed by a seven-dimension evaluation framework. Skills must pass binary assertions and a graded rubric using a model-as-judge to reach 'stable' status. The framework is documented in the repo, not marketing, and aims to provide a repeatable quality bar for media professionals.", "body_md": "\"Quality-tested\" is the kind of phrase that usually means nothing. Every tool claims it. Most mean \"we tried it once and it didn't crash.\" So when a library of 394 free Claude skills puts \"all quality-tested, mean 4.38/5\" on the tin, the fair response is: prove it.\n\nHere's exactly what the claim means, including where it's soft.\n\n`stable`\n\nonly if it clears two bars\nEvery skill carries a status. To reach `stable`\n\n— the only status the library promotes — it has to pass a seven-dimension evaluation:\n\nThe library mean across all stable skills is 4.38. The whole framework — dimensions, thresholds, the banned-phrase list — is in the repo, not a marketing page.\n\nCode eval is binary: it runs or it doesn't. Prose has no green checkmark, so the library tests in two layers. First, **binary assertions** catch the mechanical failures — did it produce the required sections, did it refuse to fabricate a quote with no source. Across thousands of these the pass rate is high, and the few \"failures\" were skills *correctly* refusing to invent content on deliberately thin inputs — the behaviour you want. Second, the **graded rubric** above handles the judgment calls binary checks can't.\n\nThe graded scoring uses a model as judge, and models are generous: they tend to like fluent text, including fluent AI text. So the scores are treated as a **filter, not a verdict**. Three things keep them honest:\n\nIt is not a guarantee every output is perfect. It's a documented, repeatable bar that's a lot higher than \"we tried it once.\"\n\nBecause the audience is media professionals, and they detect generic instantly. A skill library for people who notice bad writing has to be testable on exactly that axis, or the whole premise collapses. The eval framework isn't a credential — it's the thing that makes \"doesn't sound like AI\" a claim you can check instead of a vibe.\n\nOpen the repo, open any skill, read its example, and judge for yourself. That's the test that matters.", "url": "https://wpnews.pro/news/what-quality-tested-actually-means-for-a-library-of-394-ai-skills", "canonical_source": "https://dev.to/urgrue/what-quality-tested-actually-means-for-a-library-of-394-ai-skills-g0d", "published_at": "2026-06-30 15:25:49+00:00", "updated_at": "2026-06-30 15:48:53.732536+00:00", "lang": "en", "topics": ["large-language-models", "ai-tools", "ai-products", "developer-tools", "natural-language-processing"], "entities": ["Claude", "Anthropic"], "alternates": {"html": "https://wpnews.pro/news/what-quality-tested-actually-means-for-a-library-of-394-ai-skills", "markdown": "https://wpnews.pro/news/what-quality-tested-actually-means-for-a-library-of-394-ai-skills.md", "text": "https://wpnews.pro/news/what-quality-tested-actually-means-for-a-library-of-394-ai-skills.txt", "jsonld": "https://wpnews.pro/news/what-quality-tested-actually-means-for-a-library-of-394-ai-skills.jsonld"}}