cd /news/developer-tools/i-built-a-homebrew-for-ai-skills-ins… Β· home β€Ί topics β€Ί developer-tools β€Ί article
[ARTICLE Β· art-32268] src=dev.to β†— pub= topic=developer-tools verified=true sentiment=↑ positive

I built a Homebrew for AI skills: install flow and eval harness inside

A developer built SkillForge, an open-source tool that installs AI coding skills like Homebrew packages, using two pip commands. The tool generates stack-specific skills named after their target technologies, such as 'backend-fastapi-postgres', rather than generic persona-based skills. SkillForge runs locally, binds to 127.0.0.1, ships no telemetry, and supports six LLM provider families.

read9 min views2 publishedJun 18, 2026

Last quarter I spent an afternoon writing a SKILL.md for backend FastAPI work by hand. By the time I had a usable prompt set, three templates, and a config file, I realized nobody else on the team would ever do this. Engineering skills are either too rigid, a hand-written prompt for one specific stack, or too generic, a "full-stack" skill that tells you to "use best practices." The GPT Store proved the failure mode in public: when quality is not measurable, prompt wrappers win and nobody comes back.

SkillForge is my attempt at a fix. Install is two pip commands and a serve:

cd apps/api && pip install -e ".[dev]"
cd ../cli   && pip install -e .
skillforge serve --build-web    # β†’ http://localhost:8000

The thing I kept hitting was the category axis. I wanted a skill for a specific stack, FastAPI plus Postgres plus Alembic plus pytest plus Docker, but every skill marketplace wanted me to pick a category first. Categories are the wrong axis. Skills should be named after the stack they target, the way Homebrew formulae are named after the binary they install. A skill called backend-fastapi-postgres

is honest about what it does. A skill called "Senior Backend Engineer" is vibes.

The GPT Store angle was the other half. When you cannot measure skill quality, the people who optimize for thumbnails win. The people who optimize for output leave. The marketplace fills with wrappers and stays that way.

Flip the workflow. Describe the need in plain English. The planner picks the tools and explains each pick. The generator produces a focused skill. Skills are named for their stack, not for a persona:

backend-fastapi-postgres

, data-airflow-dbt-bigquery

, devops-kubernetes-helm-terraform

, ai-rag-langchain-pgvector

, observability-opentelemetry-grafana

, web-scraping-python-playwright

.

Local-first means local-first. SkillForge binds to 127.0.0.1

, ships no telemetry, and does not auto-execute generated scripts. The only outbound traffic is to the LLM provider you configure. Skills land on disk under ~/.skillforge/skills

. The filesystem is the source of truth.

Three apps, one service layer. The CLI and the API import the same Python services, so skillforge plan

on the command line and POST /api/chat/plan-skill

from the browser run the exact same code path.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  apps/web   (Next.js + TS + Tailwind)   :3000 / :8000 (bundled)  β”‚
β”‚  ChatPanel β†’ ManifestEditor β†’ SkillPreview β†’ InstallButton       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚ HTTP
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  apps/api   (FastAPI)                       :8000                β”‚
β”‚  routers/  ──►  services/  ──►  repositories/                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                             β”‚ reuses the same service layer
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  apps/cli   (Typer + Rich)                                       β”‚
β”‚  serve | plan | generate | install | list | validate | remove    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Six provider families ship today, all swappable live from the Settings page with no restart: Mock (default, offline, deterministic), OpenAI-compatible (OpenAI, OpenRouter, Groq, Together, Mistral, DeepSeek, xAI, Fireworks, Z.ai), Ollama, Anthropic, Gemini, and Cohere. The Mock provider is the reason the project runs with zero configuration and the reason the test suite passes offline.

A first plan:

skillforge plan "I need a backend skill for FastAPI, PostgreSQL, Docker, and Pytest"

The CLI surface is intentionally small: serve

, plan

, generate

, install

, list

, validate

, remove

. The generator is pure with respect to execution. It never calls subprocess

. The scripts/

directory it writes is reference text on disk, runnable when you choose to run it.

A skill lands at ~/.skillforge/skills/<name>/

with this shape:

~/.skillforge/skills/backend-fastapi-postgres/
  SKILL.md
  README.md
  config.yaml
  prompts/
  templates/
  scripts/      # real FastAPI server, Alembic, pytest, Dockerfile, MCP server
  examples/

The scripts/

directory is the part that took the longest. Each artifact is a real, runnable file produced by a ToolArtifactRegistry

at generation time, not a placeholder. The FastAPI skill ships a working dev_server.py

. The data skill ships an Alembic migrate.sh

. The devops skill ships a Helm Chart.yaml

. The config.yaml

is a manifest, not a config dump:

schema_version: "1.0"
skill:
  name: backend-fastapi-postgres
  domain: Backend Engineering
tools:
  - name: Python
    category: language
    reason: Primary language for this backend skill.
safety:
  auto_execute_scripts: false
  require_user_confirmation_before_install: true

Every generated skill carries the safety

block. auto_execute_scripts

is always false. The installer will not overwrite an existing skill without an explicit flag.

SkillForge is not a new shape. It is the same shape as four package managers you already know.

SkillForge Homebrew npm VS Code Helm
config.yaml
formula (.rb) package.json package.json (contributes) Chart.yaml
SKILL.md
install block main / bin extension.ts templates/
scripts/
resource blocks bin/ bundled commands n/a
~/.skillforge/skills
Cellar node_modules ~/.vscode/extensions release namespace
.skillpkg tarball
bottle pack tarball .vsix chart .tgz
marketplace bridge tap (git repo) scoped pkg + token Marketplace publisher OCI registry
skillforge validate
brew audit
npm publish --dry-run
vsce package helm lint

Homebrew taps federate anyone's git repo of formulae (https://docs.brew.sh/Formula-Cookbook). npm scoped packages use auth tokens for verified namespaces. Helm OCI registries host signed charts. The SkillForge marketplace bridge is the same model: anyone can host a marketplace, and the local app pairs with it via a 6-character code.

The marketplace is not a website. It is an HTTP contract. Anyone can run one. The local app does not care whose marketplace it is paired with. Today the repo ships a LocalStubAdapter

that implements the full publish, search, download, install flow offline, so you can try the entire loop without a cloud backend.

The pairing flow borrows from VS Code plus GitHub. The local user generates a single-use 6-character code with a 10-minute TTL:

POST /api/marketplace/pair/code
β†’ {"code": "AB3X9K", "ttl_minutes": 10}


POST /api/bridge/pair/complete
Body: {"code": "AB3X9K", "label": "skillforge-marketplace"}
β†’ 200: {"token": "<32-byte-urlsafe>", "token_id": "...", "scopes": [...]}

Only the SHA-256 hash of the token is stored locally, in a chmod 600 file. Comparison is constant-time via hmac.compare_digest

. Pairing endpoints are rate-limited to 10 attempts per minute per client, which makes the 887-million code space provably infeasible to brute force.

Default scopes are registry:read

, skills:install

, skills:publish

. The dangerous one, skills:install:unattended

, is off by default and requires explicit grant. Marketplace-originated installs land in an approval queue. The local user clicks Approve. No silent installs, ever.

The wire bundle is a gzipped tarball called .skillpkg

:

skill-creator.skillpkg
β”œβ”€β”€ PACKAGING        # JSON: name, version, packaged_at, packaged_by
β”œβ”€β”€ manifest.json    # canonical SkillManifest
β”œβ”€β”€ SKILL.md
β”œβ”€β”€ config.yaml
β”œβ”€β”€ prompts/
β”œβ”€β”€ templates/
β”œβ”€β”€ scripts/
└── examples/

Path traversal is rejected on unpack. The manifest is the source of truth.

The vision: anyone designs a skill, anyone hosts a marketplace, anyone installs. The reason the GPT Store filled with prompt wrappers is that there was no quality signal. SkillForge ships with one.

The eval harness is the quality signal. It lives at /eval

in the Web UI.

Pick a skill. Pick a prompt suite. For each prompt, the harness runs the skill's SKILL.md as guidance against the configured provider, then asks an LLM-as-judge to score the response 0 through 10 against the skill's own output_standards

. Results stream into an expandable table with color-coded scores and reasoning. Runs persist to SQLite so you can track scores across iterations.

POST /api/eval/run
{
  "skill_name": "backend-fastapi-postgres",
  "suite": "general",
  "provider": "anthropic",
  "judge_provider": "openai-compatible"
}
β†’ 200: {"run_id": "...", "results": [...]}

Compare mode is where the harness earns its keep. Pick two or more skills and a shared suite. You get an aggregate score, a win count, and per-prompt side-by-side cards with the winner highlighted. Manual override is supported. When five people publish a backend-fastapi-postgres

skill, you can run them head to head on the same suite and see which one actually wins. Quality becomes measurable.

The cost guard caps completions per run at SKILLFORGE_EVAL_MAX_CALLS=50

. Eval never executes generated scripts. It only calls the chat API.

Two honest caveats. The judge and the generator can already be configured to use different providers, which kills the worst of the self-grading bias. The default config still uses the same provider for both, and most users will not change it. The roadmap item Tier 2.1 is to make the judge default to a different provider than the generator. The second caveat: per-domain output standards are still generic. Every domain currently judges against one shared standards list. Tier 1.2 on the roadmap specializes them.

The GPT Store failed partly because there was no quality signal. SkillForge ships with one. It is rough, it has known bias, and it is better than nothing.

Safety first. The local server binds to 127.0.0.1

by default. A LocalOriginGuardMiddleware

rejects browser requests whose Origin

or Referer

header names a non-loopback host, which is the CSRF and DNS-rebinding defense Jupyter and VS Code use. Token comparison is constant-time. Install is atomic: write to a staging directory, then os.replace

. SQLite runs in WAL mode with busy_timeout=5000

and synchronous=NORMAL

, so concurrent eval and registry access no longer hits "database is locked." Generated scripts are never executed automatically. The tool executor requires explicit confirm=True

, an allowlist match, and a 30-second timeout.

Now the limits. These are the engagement magnet, so I will be direct about them:

./scripts/build-binary.sh

.If you want to find a bug, those bullets are where to look first.

The big vision is a federated marketplace where anyone can design a skill, publish it, and have end users compare skill outputs head to head. Best skills rise. Weak skills get forked and improved.

Concrete items on the roadmap:

Out of scope by design: cloud sync, multi-user workspaces, auth, team permissions, remote deployment, auto-executing generated scripts. If those matter to you, the MIT license and the clean service layer make forking straightforward.

git clone https://github.com/sulthonzh/skillforge
cd skillforge
cd apps/api && pip install -e ".[dev]"
cd ../cli   && pip install -e .
skillforge serve --build-web

I want bug reports and edge cases, not stars. Tell me where it breaks. Specific things I would like feedback on:

The file apps/api/skillforge_api/data/tool_catalog.yaml

is a great first PR. Add a domain. Add a tool. Add a reason. The catalog is the part that gets better with every contributor.

── more in #developer-tools 4 stories Β· sorted by recency
── more on @skillforge 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/i-built-a-homebrew-f…] indexed:0 read:9min 2026-06-18 Β· β€”