Build Your Own Eval Harness from Scratch with Bun and Claude -p

A developer built a custom AI agent evaluation harness in a single Bun file using the Claude CLI, avoiding external frameworks or SaaS platforms. The harness runs the agent in a sandbox, grades outputs with string checks and LLM-as-judge, and loops through test cases with pass/fail reporting. The approach demonstrates that effective eval systems require only a runtime, a grading method, and a loop.

You don’t need a framework, a SaaS dashboard, or a dependency to test an AI agent. You need a way to run it, a way to grade it, and a loop around both. Here we build an eval harness in a single Bun file, start to finish, every line explained. By the end you’ll have one evals.ts file that spins up a sandbox, drives the agent through the claude CLI, and grades the result three ways. What we’re building An eval is a test for software that isn’t deterministic. A unit test asks “does 2 + 2 return 4?”, but an AI agent gives you a different paragraph every time you ask, so there’s no single value to assert against. An eval instead pins down one observable behavior “when there’s no plan yet, it recommends planning first” and checks whether the agent did it, while tolerating the fact that the exact words vary. People reach for hosted platforms for this. You don’t have to. Every eval harness, underneath the dashboard, is the same three moves: Run the agent. Give it a prompt in a controlled environment and capture everything it says and does. Grade the result. Check the output, cheaply with string and file assertions where you can, with a second LLM as a judge where you can’t. Loop and report. Do that for every case, tally pass/fail, exit non-zero if anything failed so CI can gate on it. Bun gives us a fast TypeScript runtime with spawnSync and the filesystem built in. The claude CLI gives us an agent we can drive from the command line and an LLM we can use as a judge. That’s everything. You’ll end up with a single evals.ts file, roughly 150 lines, that you run with bun evals.ts , built one piece at a time. If you want to understand the thing on the other end of the harness too, I wrote a companion piece on building your own coding agent from scratch /posts/building-your-own-coding-agent-from-scratch/ Building Your Own Coding Agent from Scratch A practical guide to creating a minimal Claude-powered coding assistant in TypeScript. Start with a basic chat loop and progressively add tools until you have a fully functional coding agent in about 400 lines. . Setup: Bun and the claude CLI Two prerequisites, both one-liners: 1. Bun — the runtime that runs our harness curl -fsSL https://bun.sh/install | bash 2. The Claude Code CLI — the agent we're testing, and our judge npm install -g @anthropic-ai/claude-code sanity check: this should print a model's reply claude -p "say hi in three words" --output-format json The key flag we’ll lean on is --output-format json , which makes the CLI print one machine-readable envelope instead of a stream of human text. Make a folder, drop in an empty evals.ts , and let’s fill it. Step 1: drive the agent from code First, a function that runs the agent on a prompt and hands back its reply. We shell out to claude -p the “print” / non-interactive mode and parse the JSON envelope it prints. That envelope carries the final text in result , the dollar cost in total cost usd , and an is error flag. js // evals.ts import { spawnSync } from "bun"; // Run the agent on prompt inside cwd ; return its final reply. function runAgent prompt: string, cwd: string { const res = spawnSync { cmd: "claude", "-p", prompt, "--output-format", "json", // one JSON envelope on stdout "--permission-mode", "bypassPermissions", // don't prompt us mid-run "--max-budget-usd", "0.50", // hard safety cap per run , cwd, stdout: "pipe", stderr: "pipe", timeout: 180 000, } ; const envelope = JSON.parse res.stdout.toString ; return { text: envelope.result ?? "", ok: res.exitCode === 0 && envelope.is error == true, cost: Number envelope.total cost usd ?? 0 , }; } Step 2: give it a sandbox to act in Letting an agent loose in your real repo is a bad idea, and it makes runs non-repeatable. Instead, every case gets a fresh throwaway git repo seeded with the files that behavior needs, a fixture. When the run is done, you can inspect or delete it. js import { mkdtempSync, mkdirSync, writeFileSync } from "node:fs"; import { tmpdir } from "node:os"; import { join, dirname } from "node:path"; // Make a throwaway git repo seeded with files ; return its path. function makeSandbox files: Record<string, string { const dir = mkdtempSync join tmpdir , "eval-" ; spawnSync { cmd: "git", "init", "-q" , cwd: dir } ; for const path, content of Object.entries files { const target = join dir, path ; mkdirSync dirname target , { recursive: true } ; writeFileSync target, content ; } return dir; } Step 3: grade with cheap, deterministic checks Now the grading. Start with the cheapest tool that captures the behavior: plain string and file checks. They’re free, instant, and never flaky. Reach for the LLM judge only for what these can’t express. js import { existsSync } from "node:fs"; const has = haystack: string, needle: string = haystack.toLowerCase .includes needle.toLowerCase ; type Checks = { required substrings?: string ; // must appear in the reply forbidden substrings?: string ; // must NOT appear required files?: string ; // must exist in the sandbox after the run }; // Returns label, passed for each check. function checkAssertions checks: Checks, reply: string, dir: string { const out: string, boolean = ; for const s of checks.required substrings ?? out.push contains "${s}" , has reply, s ; for const s of checks.forbidden substrings ?? out.push excludes "${s}" , has reply, s ; for const f of checks.required files ?? out.push created ${f} , existsSync join dir, f ; return out; } Step 4: grade fuzzy behavior with an LLM judge Some behaviors have no keyword. “Did it read the repo before asking its first question?” “Did it explain the trade-off?” For those, you hand the reply to a second, cheaper model and ask it to grade each expectation. It’s the powerful but pricey rung, so use it sparingly. Two details earn their keep in the prompt. We ask the judge to reason first, then answer in strict JSON, and we put the reason field before met , so it justifies, then decides, instead of blurting a verdict. We strip the reasoning before parsing. // Ask a cheap model whether reply meets each expectation. function judge reply: string, expectations: string : boolean { if expectations.length === 0 return ; const numbered = expectations.map e, i = ${i + 1}. ${e} .join "\n" ; const prompt = You are grading an AI agent's reply against a list of expectations. First reason inside a single <thinking …</thinking block. Then, after the closing tag, output STRICT JSON only: {"results": {"reason":"...","met":true} } one entry per expectation, in order. === REPLY === ${reply} === END === Expectations: ${numbered} ; const res = spawnSync { cmd: "claude", "-p", prompt, "--model", "claude-haiku-4-5", // small + cheap is plenty for grading "--output-format", "json", "--permission-mode", "bypassPermissions", "--max-budget-usd", "0.10", , stdout: "pipe", stderr: "pipe", timeout: 180 000, } ; // strip the <thinking block, then grab the JSON object const text = JSON.parse res.stdout.toString .result ?? "" .replace /<thinking \s\S ?<\/thinking /gi, "" ; const json = JSON.parse text.slice text.indexOf "{" , text.lastIndexOf "}" + 1 ; return expectations.map , i = json.results?. i ?.met === true ; } Read the judge's homework A judged eval is only as trustworthy as the judge. The first few times, log the raw judge output and read its reasoning. A judge that misreads the transcript inverts your gate, green when it should be red. Reading two samples is cheap insurance. Step 5: run it more than once and vote Run an agent once and a pass might be luck, a fail might be a bad roll. The fix is to run each case a few times and decide by majority. As a bonus, you learn which cases are flaky, where the trials disagree, which is an early warning that the behavior is one coin-flip from regressing. // Run fn N times, return how many returned true + whether the majority did. function vote trials: number, fn: = boolean { let correct = 0; for let i = 0; i < trials; i++ if fn correct++; return { correct, passed: correct 2 trials, // strict majority flaky: correct 0 && correct < trials, // trials disagreed }; } This is the one place evals cost real money, three runs is three times the spend, so it’s a pre-release pass, not an every-keystroke check. Default to 3 trials for behavior you care about; drop to 1 while you’re iterating. Putting it together Now the spine. Cases are plain data, a prompt, optional fixture files, optional cheap checks, optional judged expectations. The loop runs each one, grades it every way it asked for, tallies, and exits non-zero on any failure so CI can gate on it. type EvalCase = { id: string; prompt: string; files?: Record<string, string ; checks?: Checks; expectations?: string ; }; const cases: EvalCase = { id: "recommends-planning-first", prompt: "I want to add team billing. What should I do first?", checks: { required substrings: "plan" , forbidden substrings: "just start coding" , }, expectations: "Recommends clarifying or writing a plan before implementation", "Does not start writing code immediately", , }, ; let pass = 0, fail = 0, spent = 0; for const c of cases { console.log \n▶ ${c.id} ; // each trial runs in its own fresh sandbox const result = vote 3, = { const dir = makeSandbox c.files ?? {} ; const run = runAgent c.prompt, dir ; spent += run.cost; if run.ok return false; const checks = checkAssertions c.checks ?? {}, run.text, dir ; const judged = judge run.text, c.expectations ?? .map met, i = expectation ${i + 1} , met as string, boolean ; const all = ...checks, ...judged ; for const label, ok of all console.log ${ok ? "✓" : "✗"} ${label} ; return all.every , ok = ok ; } ; console.log ${result.passed ? "PASS" : "FAIL"} ${result.correct}/3${result.flaky ? " flaky " : ""} ; result.passed ? pass++ : fail++; } console.log \n${pass} passed, ${fail} failed — $${spent.toFixed 4 } ; process.exit fail 0 ? 1 : 0 ; Wire it into package.json so it’s one command: "scripts": { "test:evals": "bun evals.ts" } Then: bun run test:evals ▶ recommends-planning-first ✓ contains "plan" ✓ excludes "just start coding" ✓ expectation 1 ✓ expectation 2 ✓ contains "plan" ✓ excludes "just start coding" ✓ expectation 1 ✓ expectation 2 ✗ contains "plan" ✓ excludes "just start coding" ✓ expectation 1 ✓ expectation 2 PASS 2/3 flaky 1 passed, 0 failed — $1.0247 That’s the real output from running this exact file against the live CLI, not a cleaned-up screenshot. Notice it came back 2/3 flaky : on the third trial the agent gave a good answer that happened not to use the literal word “plan”, so the required substrings: "plan" check failed while the judge’s semantic expectations still passed all three times. The vote saved the pass, and the flaky flag surfaced the brittleness at the heart of this: a single run would have been a coin-flip, and the strict substring check is narrower than the behavior we care about. That one run cost $1.02, three trials with a judge each. That’s the whole harness. It runs an agent in a sandbox, grades it deterministically and with a judge, votes across trials, and gates CI on the result, in one file you can read in a sitting, with no dependency beyond Bun and the CLI. Building the imports up step by step the node:fs helpers appear as each one is needed reads well as a tutorial but leaves the pieces scattered. Below is the whole thing assembled into one copy-paste-ready file, the version that produced the output above: Full evals.ts — the complete, runnable file js // evals.ts import { spawnSync } from "bun"; import { mkdtempSync, mkdirSync, writeFileSync, existsSync } from "node:fs"; import { tmpdir } from "node:os"; import { join, dirname } from "node:path"; // Run the agent on prompt inside cwd ; return its final reply. function runAgent prompt: string, cwd: string { const res = spawnSync { cmd: "claude", "-p", prompt, "--output-format", "json", // one JSON envelope on stdout "--permission-mode", "bypassPermissions", // don't prompt us mid-run "--max-budget-usd", "0.50", // hard safety cap per run , cwd, stdout: "pipe", stderr: "pipe", timeout: 180 000, } ; const envelope = JSON.parse res.stdout.toString ; return { text: envelope.result ?? "", ok: res.exitCode === 0 && envelope.is error == true, cost: Number envelope.total cost usd ?? 0 , }; } // Make a throwaway git repo seeded with files ; return its path. function makeSandbox files: Record<string, string { const dir = mkdtempSync join tmpdir , "eval-" ; spawnSync { cmd: "git", "init", "-q" , cwd: dir } ; for const path, content of Object.entries files { const target = join dir, path ; mkdirSync dirname target , { recursive: true } ; writeFileSync target, content ; } return dir; } const has = haystack: string, needle: string = haystack.toLowerCase .includes needle.toLowerCase ; type Checks = { required substrings?: string ; // must appear in the reply forbidden substrings?: string ; // must NOT appear required files?: string ; // must exist in the sandbox after the run }; // Returns label, passed for each check. function checkAssertions checks: Checks, reply: string, dir: string { const out: string, boolean = ; for const s of checks.required substrings ?? out.push contains "${s}" , has reply, s ; for const s of checks.forbidden substrings ?? out.push excludes "${s}" , has reply, s ; for const f of checks.required files ?? out.push created ${f} , existsSync join dir, f ; return out; } // Ask a cheap model whether reply meets each expectation. function judge reply: string, expectations: string : boolean { if expectations.length === 0 return ; const numbered = expectations.map e, i = ${i + 1}. ${e} .join "\n" ; const prompt = You are grading an AI agent's reply against a list of expectations. First reason inside a single <thinking …</thinking block. Then, after the closing tag, output STRICT JSON only: {"results": {"reason":"...","met":true} } one entry per expectation, in order. === REPLY === ${reply} === END === Expectations: ${numbered} ; const res = spawnSync { cmd: "claude", "-p", prompt, "--model", "claude-haiku-4-5", // small + cheap is plenty for grading "--output-format", "json", "--permission-mode", "bypassPermissions", "--max-budget-usd", "0.10", , stdout: "pipe", stderr: "pipe", timeout: 180 000, } ; // strip the <thinking block, then grab the JSON object const text = JSON.parse res.stdout.toString .result ?? "" .replace /<thinking \s\S ?<\/thinking /gi, "" ; const json = JSON.parse text.slice text.indexOf "{" , text.lastIndexOf "}" + 1 ; return expectations.map , i = json.results?. i ?.met === true ; } // Run fn N times, return how many returned true + whether the majority did. function vote trials: number, fn: = boolean { let correct = 0; for let i = 0; i < trials; i++ if fn correct++; return { correct, passed: correct 2 trials, // strict majority flaky: correct 0 && correct < trials, // trials disagreed }; } type EvalCase = { id: string; prompt: string; files?: Record<string, string ; checks?: Checks; expectations?: string ; }; const cases: EvalCase = { id: "recommends-planning-first", prompt: "I want to add team billing. What should I do first?", checks: { required substrings: "plan" , forbidden substrings: "just start coding" , }, expectations: "Recommends clarifying or writing a plan before implementation", "Does not start writing code immediately", , }, ; let pass = 0, fail = 0, spent = 0; for const c of cases { console.log \n▶ ${c.id} ; // each trial runs in its own fresh sandbox const result = vote 3, = { const dir = makeSandbox c.files ?? {} ; const run = runAgent c.prompt, dir ; spent += run.cost; if run.ok return false; const checks = checkAssertions c.checks ?? {}, run.text, dir ; const judged = judge run.text, c.expectations ?? .map met, i = expectation ${i + 1} , met as string, boolean ; const all = ...checks, ...judged ; for const label, ok of all console.log ${ok ? "✓" : "✗"} ${label} ; return all.every , ok = ok ; } ; console.log ${result.passed ? "PASS" : "FAIL"} ${result.correct}/3${result.flaky ? " flaky " : ""} ; result.passed ? pass++ : fail++; } console.log \n${pass} passed, ${fail} failed — $${spent.toFixed 4 } ; process.exit fail 0 ? 1 : 0 ; Prove it works: write it red first Before you trust a case, watch it fail. Point it at a behavior the agent doesn’t have yet and confirm it goes red for the right reason. An eval written after the behavior already works might be asserting on nothing, and you’d never know, because it’s green from birth. Testing your own Claude Code skill So far the system-under-test was a bare agent answering a prompt. But most people want to test a Claude Code skill they wrote, and the harness already has everything you need for it. A skill is a SKILL.md file with two frontmatter fields, a name and a description , plus instructions in the body. Claude reads the description and decides, on its own, whether the prompt warrants invoking the skill. That gives you two distinct things to test: Does it trigger? Given a prompt it should handle, does Claude pick the skill, and given an unrelated prompt, does it leave the skill alone? Does it behave? Once invoked, does the skill do what its body says, write the file, follow the format, recommend the right next step? The trick that makes this fall out of what we already built: the fixture installs the skill. Claude Code discovers project skills from .claude/skills/<name /SKILL.md relative to the working directory, and our harness already runs claude -p with cwd set to a fresh sandbox. So if you seed the skill file into fixture.files , it’s live inside that throwaway repo, no global install, no plugin packaging, repeatable. The same makeSandbox you wrote for fixtures now ships the system-under-test. Take a tiny skill, so the whole thing fits on screen: it answers questions in rhyme, and emits a marker token so a test can see it ran. js // The skill under test, as a one-file fixture. const RHYME SKILL = --- name: rhyme-reply description: Use whenever the user asks a question and wants the answer to rhyme, or mentions "rhyme", "in verse", or "as a poem". --- Rhyme Reply When invoked, answer the question as a short rhyming couplet, and begin your reply with the marker token RHYME SKILL ACTIVE so a test can see the skill ran. ; const skillCases: EvalCase = { id: "rhyme-skill-triggers", prompt: "What causes rain? Answer as a rhyme.", files: { ".claude/skills/rhyme-reply/SKILL.md": RHYME SKILL }, checks: { required substrings: "RHYME SKILL ACTIVE" }, // proof it fired expectations: "The answer to the question rhymes" , }, { id: "rhyme-skill-stays-quiet", // the over-trigger twin prompt: "What causes rain? Just explain it plainly in one sentence.", files: { ".claude/skills/rhyme-reply/SKILL.md": RHYME SKILL }, checks: { forbidden substrings: "RHYME SKILL ACTIVE" }, // must NOT fire }, ; These are plain EvalCase values, so they drop straight into the same cases array and run through the same loop, no new harness code. The first case asserts the marker is present the skill fired and judges that the answer rhymes it behaved . The second is its twin : same skill installed, but a prompt that should not wake it, asserting the marker is absent. Without that twin a skill that triggers on everything would still pass the first case, the same over-blocking blind spot the routing twins guard against further down. Running both against the live CLI, a single trial of each looks like this, the skill fires on the rhyme prompt and stays silent on the plain one: ▶ rhyme-skill-triggers ✓ contains "RHYME SKILL ACTIVE" ✓ expectation 1 PASS ▶ rhyme-skill-stays-quiet ✓ excludes "RHYME SKILL ACTIVE" PASS One honest limitation: with plain --output-format json you only see the final reply, so you’re inferring the skill fired from a fingerprint in its output here, a marker token; for a real skill, the file it writes or the format it follows . That’s fine when the skill leaves a trace. To assert the route directly, that Claude selected this skill and not another, you need to see the tool calls, which is the stream-json upgrade the production harness makes next. Where this goes in production The harness above is the honest core. A production version adds polish, but nothing exotic. I run this same skeleton in AFK https://github.com/alexanderop/afk , an open-source Claude Code plugin whose skills route a coding task through plan → implement → clean up → verify. Its write-evals skill https://github.com/alexanderop/afk/tree/main/skills/write-evals ships a self-contained https://github.com/alexanderop/afk/blob/main/skills/write-evals/run-evals.template.ts , the grown-up version of the file we built above, and the live suite it runs lives under run-evals.template.ts . Three things it adds, all visible in that code: https://github.com/alexanderop/afk/tree/main/tests/e2e/evals tests/e2e/evals/ Cases are data, not code. Instead of a TypeScript array, each suite is a JSON file specs/<suite /evals.json the runner loads. A case is the same shape you already know a prompt, an optional fixture, deterministic assertions, optional judged expectations , only declared, so non-programmers can add coverage and the runner never changes: { "id": "grill-plan-records-reference-repo", "prompt": "Earlier we cloned https://github.com/acme/awesome-streamer into reference/awesome-streamer to copy its SSE pattern. Finish by writing docs/plans/streaming.md for a /chat SSE endpoint that follows that repo.", "fixture": { "files": { "reference/awesome-streamer/README.md": "Source: https://github.com/acme/awesome-streamer\n" } }, "expectations": "Records in the plan that a reference repo was cloned to copy a pattern", "Points implementation at the real cloned source rather than memory" , "assertions": { "required files": "docs/plans/streaming.md" , "required file substrings": { "docs/plans/streaming.md": "reference/awesome-streamer", "github.com/acme/awesome-streamer" } } } Note the two-part requirement record the clone and point at the real source split into two assertions, so a half-right plan can’t pass. A dedicated routing case type. Most agent behavior is “which path did it pick?”, which a substring check grades, no judge needed. AFK marks those kind:"routing" and grades them on expect / forbid lists. One real example: when a plan already exists and there’s no diff yet, the help skill should point you at afk:implement , not back at planning or forward to QA: { "id": "help-after-plan", "prompt": "What now? Assume docs/plans/checkout.md exists and there is no implementation diff.", "expected output": "Recommends afk:implement as the next step.", "kind": "routing", "fixture": { "files": { "docs/plans/checkout.md": " Checkout Plan\n\n Tasks\n1. Implement checkout.\n" } }, "routing": { "expect": "afk:implement" , "forbid": "Next step: Q ", "run afk:qa now" } } A trial is correct only if every expect string is present and no forbid string is. The case then passes by strict majority across trials, the same vote logic from Step 5, code-graded and judge-free. Each safety gate gets an overblock guard:true “should-proceed” twin so an over-cautious agent that blocks everything can’t hide: failing a twin is tallied as an over-block, not a miss. Richer transcripts and saved artifacts. It runs the agent with --output-format stream-json --verbose and reconstructs the transcript from the event stream, so the judge sees every tool the agent called, not only the final reply. That’s what you need when “did it read the repo first?” is the behavior. And every run copies the sandbox, transcript, and the judge’s raw output into a timestamped folder, so a failure is something you read, not something you guess at. What else you can point it at A skill is just one thing you can put under test. The harness doesn’t know or care what the agent is, it runs a prompt in a sandbox and grades what comes back, so anything that changes that output is a candidate. A few that have earned their keep: Your CLAUDE.md and house rules. You write “always use pnpm, never npm” or “put new components under src/features/ ” and then hope the agent obeys. Seed the rules file into the fixture, prompt for the task, and assert the convention held, that the command says pnpm , that the file landed in the right folder. Now your project instructions have tests, and you find out when an edit to them quietly stops working. A prompt you’re tuning. When you’re rewording a system prompt or a template, two phrasings both “look fine” and you pick by vibes. Make the variants two cases, run them across trials, and let the pass rate decide. The vote turns “I think this wording is better” into a number. A model upgrade. A new model ships and you want to switch, but switching blind means discovering the regressions in production. Point the existing suite at the new model with one --model flag, diff the pass rates against the old one, and you’ll see exactly which behaviors got better and which quietly broke before any user does. An MCP server or custom tool. Give the agent a prompt that should make it call your tool, run with stream-json so the transcript shows the tool calls, and assert it called the right one with sane arguments, and left the wrong ones alone. Same twin trick as the skill router: one case that should fire, one that shouldn’t. Refusals and guardrails. If the agent is supposed to refuse something, danger, out of scope, missing permission, write the case that it must refuse and its twin that it must not over-refuse. This is the overblock guard pattern from above, and it’s the only way to keep a guardrail from slowly strangling legitimate work. Subagents, hooks, and slash commands. Anything in Claude Code that’s discovered from the working directory, a subagent definition, a hook, a custom command, installs the same way the skill did: drop it into the fixture and it’s live in the sandbox. The system-under-test is always just a file you seed. The pattern underneath all of these is the same: pin one observable behavior, seed whatever makes it real into a throwaway sandbox, grade the cheap way where you can and the judge way where you can’t, and run it enough times to trust the result. Once you have that loop, the question stops being “can I test this?” and becomes “what’s the behavior I care about?”, which is the one worth asking. The one real downside is cost. Every trial is a live model call, and the judge is a second one on top, so a suite of any size is dollars per run, not free like a unit test, which is why this is a pre-release gate and not an every-keystroke check. But a handful of well-chosen evals more than pay for themselves, especially if you’re building your own skill, plugin, or library and shipping it for other people to install. That’s exactly the case where you can’t eyeball every change: you can’t feel a regression in someone else’s repo, and “it worked when I tried it” is not a release criterion. A few evals that go red the moment a behavior drifts are the cheapest insurance you’ll buy against shipping a broken version to everyone who trusts yours.