# CompletionKit – stop guessing whether your prompt change helped

> Source: <https://completionkit.com/>
> Published: 2026-05-31 02:13:36+00:00

Open source · Hosted Cloud, standalone app, or Rails engine

# Test your AI app *properly*. Improve it with confidence.

Every change to your AI is a guess until you test it. CompletionKit runs it against real inputs, scores the outputs, and shows you what improved and what broke before it ships.

Free + self-hostable · OpenAI · Anthropic · Ollama · 100+ via OpenRouter

01The problem

## You changed the prompt. It feels better. You need to know.

"Seems better" isn't a metric you can ship behind — but without the right tools, it's usually all you've got.

#### You're shipping on vibes.

Without numbers, "looks right" is the bar — and "looks right" depends entirely on which inputs you happened to test.

#### Prompts drift, silently.

A one-line tweak to a string literal doesn't show up in review as "prompt change." Behavior shifts in prod, and nobody can point to when or why.

#### Every fix risks a quiet regression.

v4 nails the case that prompted the change. The inputs you didn't think to retest? They might be worse — and you'd never know.

02How a run works

## Four moves. Then you have evidence.

A run is the unit of work — reproducible by prompt, dataset, model, metrics.

-
**Write a prompt.** Template with

`{{vars}}`

. Upload a CSV of real inputs. CompletionKit merges them, one prompt per row. -
**Run a model.** OpenAI, Anthropic, Ollama, or any of 100+ via OpenRouter. Same dataset, any number of models in parallel.

-
**Score against your metrics.** An LLM judge scores each output 1–5 on the metrics you define. Empathy, clarity, policy — whatever

*good*means for you. -
**Iterate. Re-run.** Edit and the prompt forks a new version. Ask CompletionKit to suggest one, grounded in the judge's feedback on your data.

Got outputs already — production logs, another model, hand-curated examples? Skip steps 1 and 2: a **judge-only run** grades any column in your dataset against your metrics, no generation needed.

03The unlock

## Point your coding agent at CompletionKit. *Walk away.*

REST, MCP, the standalone Bench — same shape. Plug Claude Code or Cursor into the MCP server and tell it to make a prompt better. It does.

#### You set the bar.

Define the metrics and the score a prompt has to clear. That judgment is yours; it's the one part of the loop the agent can't do for you.

#### The agent runs the loop.

Point Claude Code or Cursor at the MCP server. It revises the prompt, runs it on your dataset, and re-scores, pass after pass, with no babysitting.

#### You approve the result.

It stops at the first version that clears your bar and hands it back. Review the diff and ship it, or send it round again.

04Compared

## What you actually get, line by line.

Every tool here is doing real work. CompletionKit is the shape we kept needing.

| OpenAI Evals | Workbench | Braintrust | Langfuse | Promptfoo | CompletionKit | |
|---|---|---|---|---|---|---|
| Multi-provider | — | — | ||||
| Local models (Ollama) | — | — | — | |||
| Custom scoring metrics | Partial | Partial | ||||
| AI suggestions from your data | — | Generic | — | — | — | |
| Versioned prompts via API | — | — | — | |||
| MCP server | — | — | — | — | — | |
| Free + self-hostable | Partial | Partial | — |

05Three ways to run it

## Same product. *Three* ways to ship it.

Cloud is the fastest start. The standalone app is for self-hosting on your own infra. The engine is a Ruby gem you mount into an existing Rails app. Same code underneath — pick the deployment that fits.

Cloud

Sign up, paste your provider keys, run your first eval in under five minutes.

- Free tier: 20 runs / month
- Bring your own provider keys
- Team workspaces and roles

Standalone app

Same webapp, your infra. Clone the repo, point at a Postgres, run web + worker.

- No multi-tenancy, no phone-home
- Provider keys via env or Settings
- Source-available · BSL 1.1

Rails engine

Add the gem, mount the engine, run the migrations — shares your app's auth and DB.

- Plays nice with your existing auth
- Active Job (Solid Queue, Sidekiq, …)
- No separate deploy

## FAQ

### Which providers does it support?

OpenAI, Anthropic, Ollama (or any OpenAI-compatible local endpoint), and 100+ models via OpenRouter.

### Do I have to self-host?

No. [CompletionKit Cloud](/registration/new) is the hosted version with a free tier — sign up and you're running. If you'd rather host it yourself, you've got two flavors: deploy the bundled [standalone app](https://github.com/homemade-software-inc/completion-kit#or-run-the-standalone-app) on your own infra, or mount the [Rails engine](https://github.com/homemade-software-inc/completion-kit#or-mount-as-an-engine-in-your-existing-rails-app) inside an existing Rails app. Same product on every path; you bring the provider keys.

### How does my app use a prompt?

Each published prompt has a versioned URL. Your app calls it, gets the template and model, and runs the LLM the way it already does. No SDK required — it's just JSON over HTTP.

### What if I already have outputs from another system?

Use a **judge-only run**. Drop the outputs into a column of your dataset, point the run at that column, define your metrics — the LLM judge scores every row against your rubric. No prompt, no generation step. Same scoring, same per-row review you'd get from a full run. Great for grading production logs, comparing a different model's outputs, or auditing hand-written examples.

### Is the engine free?

Yes — free for any use, including in production and inside your own commercial product. Source-available under [BSL 1.1](https://github.com/homemade-software-inc/completion-kit/blob/main/LICENSE); the only carve-out is offering CompletionKit itself to third parties as a hosted or managed service. Auto-converts to GPL-3 after three years. Versions 0.2.x and earlier remain [MIT](https://github.com/homemade-software-inc/completion-kit/blob/v0.2.0/MIT-LICENSE).

### Who made it?

Built by [Homemade Software](https://www.homemade.software).