CompletionKit – stop guessing whether your prompt change helped

wpnews.pro

cd /news/ai-tools/completionkit-stop-guessing-whether-… · home › topics › ai-tools › article

[ARTICLE · art-18904] src=completionkit.com ↗ pub=2026-05-31T02:13Z topic=ai-tools verified=true sentiment=↑ positive

CompletionKit – stop guessing whether your prompt change helped

CompletionKit has launched an open-source tool for testing AI prompts against real inputs, scoring outputs with custom metrics, and identifying regressions before deployment. The platform supports multiple AI providers including OpenAI, Anthropic, and Ollama, and offers a free self-hostable option alongside cloud and Rails engine deployments. The tool aims to replace subjective "vibes-based" prompt evaluation with reproducible, data-driven testing workflows.

read5 min views23 publishedMay 31, 2026

Open source · Hosted Cloud, standalone app, or Rails engine

Every change to your AI is a guess until you test it. CompletionKit runs it against real inputs, scores the outputs, and shows you what improved and what broke before it ships.

Free + self-hostable · OpenAI · Anthropic · Ollama · 100+ via OpenRouter

01The problem

You changed the prompt. It feels better. You need to know. #

"Seems better" isn't a metric you can ship behind — but without the right tools, it's usually all you've got.

You're shipping on vibes.

Without numbers, "looks right" is the bar — and "looks right" depends entirely on which inputs you happened to test.

Prompts drift, silently.

A one-line tweak to a string literal doesn't show up in review as "prompt change." Behavior shifts in prod, and nobody can point to when or why.

Every fix risks a quiet regression.

v4 nails the case that prompted the change. The inputs you didn't think to retest? They might be worse — and you'd never know.

02How a run works

Four moves. Then you have evidence. #

A run is the unit of work — reproducible by prompt, dataset, model, metrics.

Write a prompt. Template with

{{vars}} . Upload a CSV of real inputs. CompletionKit merges them, one prompt per row. - Run a model. OpenAI, Anthropic, Ollama, or any of 100+ via OpenRouter. Same dataset, any number of models in parallel.

Score against your metrics. An LLM judge scores each output 1–5 on the metrics you define. Empathy, clarity, policy — whatever

goodmeans for you. - Iterate. Re-run. Edit and the prompt forks a new version. Ask CompletionKit to suggest one, grounded in the judge's feedback on your data.

Got outputs already — production logs, another model, hand-curated examples? Skip steps 1 and 2: a judge-only run grades any column in your dataset against your metrics, no generation needed.

03The unlock

Point your coding agent at CompletionKit. Walk away. #

REST, MCP, the standalone Bench — same shape. Plug Claude Code or Cursor into the MCP server and tell it to make a prompt better. It does.

You set the bar.

Define the metrics and the score a prompt has to clear. That judgment is yours; it's the one part of the loop the agent can't do for you.

The agent runs the loop.

Point Claude Code or Cursor at the MCP server. It revises the prompt, runs it on your dataset, and re-scores, pass after pass, with no babysitting.

You approve the result.

It stops at the first version that clears your bar and hands it back. Review the diff and ship it, or send it round again.

04Compared

What you actually get, line by line. #

Every tool here is doing real work. CompletionKit is the shape we kept needing.

OpenAI Evals	Workbench	Braintrust	Langfuse	Promptfoo	CompletionKit
Multi-provider	—	—
Local models (Ollama)	—	—	—
Custom scoring metrics	Partial	Partial
AI suggestions from your data	—	Generic	—	—	—
Versioned prompts via API	—	—	—
MCP server	—	—	—	—	—
Free + self-hostable	Partial	Partial	—

05Three ways to run it

Same product. Three ways to ship it. #

Cloud is the fastest start. The standalone app is for self-hosting on your own infra. The engine is a Ruby gem you mount into an existing Rails app. Same code underneath — pick the deployment that fits.

Cloud

Free tier: 20 runs / month
Bring your own provider keys
Team workspaces and roles

Standalone app

Same webapp, your infra. Clone the repo, point at a Postgres, run web + worker.

No multi-tenancy, no phone-home
Provider keys via env or Settings
Source-available · BSL 1.1 Rails engine

Add the gem, mount the engine, run the migrations — shares your app's auth and DB.

Plays nice with your existing auth
Active Job (Solid Queue, Sidekiq, …)
No separate deploy

FAQ #

Which providers does it support?

OpenAI, Anthropic, Ollama (or any OpenAI-compatible local endpoint), and 100+ models via OpenRouter.

Do I have to self-host?

No. CompletionKit Cloud is the hosted version with a free tier — sign up and you're running. If you'd rather host it yourself, you've got two flavors: deploy the bundled standalone app on your own infra, or mount the Rails engine inside an existing Rails app. Same product on every path; you bring the provider keys.

How does my app use a prompt?

Each published prompt has a versioned URL. Your app calls it, gets the template and model, and runs the LLM the way it already does. No SDK required — it's just JSON over HTTP.

What if I already have outputs from another system?

Use a judge-only run. Drop the outputs into a column of your dataset, point the run at that column, define your metrics — the LLM judge scores every row against your rubric. No prompt, no generation step. Same scoring, same per-row review you'd get from a full run. Great for grading production logs, comparing a different model's outputs, or auditing hand-written examples.

Is the engine free?

Yes — free for any use, including in production and inside your own commercial product. Source-available under BSL 1.1; the only carve-out is offering CompletionKit itself to third parties as a hosted or managed service. Auto-converts to GPL-3 after three years. Versions 0.2.x and earlier remain MIT.

Who made it?

Built by Homemade Software.

source & further reading

completionkit.com — original article

~/api · this article 200

$curl api.wpnews.pro/v1/news/completionkit-stop-guess…

Read original on completionkit.com → completionkit.com/

mentioned entities

CompletionKit

OpenAI

Anthropic

Ollama

OpenRouter

metadata

slugcompletionkit-stop-guessing-whether-your-prompt-change-helped

topic#ai-tools

secondary4 topics

sentimentpositive

canonicalcompletionkit.com

navigation

← prevReviving Nudge: Building an AI-P…

next →Hermes Repo Dojo: Most Agents An…

── more in #ai-tools 4 stories · sorted by recency

ai-term.com · 15 Jul · #ai-tools

Show HN: AITerm – a macOS terminal with an AI command loop and a safety gate

byteiota.com · 15 Jul · #ai-tools

Microsoft Frontier Company: $2.5B to Fix What Enterprise AI Broke

businessinsider.com · 15 Jul · #ai-tools

Anthropic official says stopping AI usage is 'the wrong' response to AI cost concerns

startupfortune.com · 15 Jul · #ai-tools

Anthropic Gives Teachers Free Claude Access as AI Giants Fight for Classrooms

── more on @completionkit 3 stories trending now

wpnews · 27 May · #artificial-intelligence

How I Run Two Claude Accounts as One

wpnews · 23 May · #artificial-intelligence

AccessLens — a blind person's lanyard, powered by Gemma 4 on-device

wpnews · 21 May · #developer-tools

Antigravity CLI: A Hands-On Guide to Google's Terminal Coding Agent

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required