{"slug": "build-your-own-eval-harness-from-scratch-with-bun-and-claude-p", "title": "Build Your Own Eval Harness from Scratch with Bun and Claude -p", "summary": "A developer built a custom AI agent evaluation harness in a single Bun file using the Claude CLI, avoiding external frameworks or SaaS platforms. The harness runs the agent in a sandbox, grades outputs with string checks and LLM-as-judge, and loops through test cases with pass/fail reporting. The approach demonstrates that effective eval systems require only a runtime, a grading method, and a loop.", "body_md": "You don’t need a framework, a SaaS dashboard, or a dependency to test an AI agent. You need a way to run it, a way to grade it, and a loop around both. Here we build an eval harness in a single Bun file, start to finish, every line explained.\n\nBy the end you’ll have one `evals.ts`\n\nfile that spins up a sandbox, drives the agent through the `claude`\n\nCLI, and grades the result three ways.\n\n## What we’re building\n\nAn eval is a test for software that isn’t deterministic. A unit test asks “does 2 + 2 return 4?”, but an AI agent gives you a different paragraph every time you ask, so there’s no single value to assert against. An eval instead pins down one observable behavior (“when there’s no plan yet, it recommends planning first”) and checks whether the agent did it, while tolerating the fact that the exact words vary.\n\nPeople reach for hosted platforms for this. You don’t have to. Every eval harness, underneath the dashboard, is the same three moves:\n\n**Run the agent.** Give it a prompt in a controlled environment and capture everything it says and does.**Grade the result.** Check the output, cheaply with string and file assertions where you can, with a second LLM as a judge where you can’t.**Loop and report.** Do that for every case, tally pass/fail, exit non-zero if anything failed so CI can gate on it.\n\nBun gives us a fast TypeScript runtime with `spawnSync`\n\nand the filesystem built in. The `claude`\n\nCLI gives us an agent we can drive from the command line and an LLM we can use as a judge. That’s everything. You’ll end up with a single `evals.ts`\n\nfile, roughly 150 lines, that you run with `bun evals.ts`\n\n, built one piece at a time.\n\nIf you want to understand the thing on the other end of the harness too, I wrote a companion piece on [ building your own coding agent from scratch ](/posts/building-your-own-coding-agent-from-scratch/) Building Your Own Coding Agent from Scratch A practical guide to creating a minimal Claude-powered coding assistant in TypeScript. Start with a basic chat loop and progressively add tools until you have a fully functional coding agent in about 400 lines. .\n\n## Setup: Bun and the claude CLI\n\nTwo prerequisites, both one-liners:\n\n```\n# 1. Bun — the runtime that runs our harness\ncurl -fsSL https://bun.sh/install | bash\n\n# 2. The Claude Code CLI — the agent we're testing, and our judge\nnpm install -g @anthropic-ai/claude-code\n\n# sanity check: this should print a model's reply\nclaude -p \"say hi in three words\" --output-format json\n```\n\nThe key flag we’ll lean on is `--output-format json`\n\n, which makes the CLI print one machine-readable envelope instead of a stream of human text. Make a folder, drop in an empty `evals.ts`\n\n, and let’s fill it.\n\n## Step 1: drive the agent from code\n\nFirst, a function that runs the agent on a prompt and hands back its reply. We shell out to `claude -p`\n\n(the “print” / non-interactive mode) and parse the JSON envelope it prints. That envelope carries the final text in `result`\n\n, the dollar cost in `total_cost_usd`\n\n, and an `is_error`\n\nflag.\n\n``` js\n// evals.ts\nimport { spawnSync } from \"bun\";\n\n// Run the agent on `prompt` inside `cwd`; return its final reply.\nfunction runAgent(prompt: string, cwd: string) {\n  const res = spawnSync({\n    cmd: [\n      \"claude\", \"-p\", prompt,\n      \"--output-format\", \"json\",          // one JSON envelope on stdout\n      \"--permission-mode\", \"bypassPermissions\", // don't prompt us mid-run\n      \"--max-budget-usd\", \"0.50\",           // hard safety cap per run\n    ],\n    cwd,\n    stdout: \"pipe\",\n    stderr: \"pipe\",\n    timeout: 180_000,\n  });\n\n  const envelope = JSON.parse(res.stdout.toString());\n  return {\n    text: envelope.result ?? \"\",\n    ok: res.exitCode === 0 && envelope.is_error !== true,\n    cost: Number(envelope.total_cost_usd ?? 0),\n  };\n}\n```\n\n## Step 2: give it a sandbox to act in\n\nLetting an agent loose in your real repo is a bad idea, and it makes runs non-repeatable. Instead, every case gets a fresh throwaway git repo seeded with the files that behavior needs, a fixture. When the run is done, you can inspect or delete it.\n\n``` js\nimport { mkdtempSync, mkdirSync, writeFileSync } from \"node:fs\";\nimport { tmpdir } from \"node:os\";\nimport { join, dirname } from \"node:path\";\n\n// Make a throwaway git repo seeded with `files`; return its path.\nfunction makeSandbox(files: Record<string, string>) {\n  const dir = mkdtempSync(join(tmpdir(), \"eval-\"));\n  spawnSync({ cmd: [\"git\", \"init\", \"-q\"], cwd: dir });\n  for (const [path, content] of Object.entries(files)) {\n    const target = join(dir, path);\n    mkdirSync(dirname(target), { recursive: true });\n    writeFileSync(target, content);\n  }\n  return dir;\n}\n```\n\n## Step 3: grade with cheap, deterministic checks\n\nNow the grading. Start with the cheapest tool that captures the behavior: plain string and file checks. They’re free, instant, and never flaky. Reach for the LLM judge only for what these can’t express.\n\n``` js\nimport { existsSync } from \"node:fs\";\n\nconst has = (haystack: string, needle: string) =>\n  haystack.toLowerCase().includes(needle.toLowerCase());\n\ntype Checks = {\n  required_substrings?: string[];   // must appear in the reply\n  forbidden_substrings?: string[];  // must NOT appear\n  required_files?: string[];        // must exist in the sandbox after the run\n};\n\n// Returns [label, passed] for each check.\nfunction checkAssertions(checks: Checks, reply: string, dir: string) {\n  const out: [string, boolean][] = [];\n  for (const s of checks.required_substrings ?? [])\n    out.push([`contains \"${s}\"`, has(reply, s)]);\n  for (const s of checks.forbidden_substrings ?? [])\n    out.push([`excludes \"${s}\"`, !has(reply, s)]);\n  for (const f of checks.required_files ?? [])\n    out.push([`created ${f}`, existsSync(join(dir, f))]);\n  return out;\n}\n```\n\n## Step 4: grade fuzzy behavior with an LLM judge\n\nSome behaviors have no keyword. “Did it read the repo before asking its first question?” “Did it explain the trade-off?” For those, you hand the reply to a second, cheaper model and ask it to grade each expectation. It’s the powerful but pricey rung, so use it sparingly.\n\nTwo details earn their keep in the prompt. We ask the judge to reason first, then answer in strict JSON, and we put the `reason`\n\nfield before `met`\n\n, so it justifies, then decides, instead of blurting a verdict. We strip the reasoning before parsing.\n\n```\n// Ask a cheap model whether `reply` meets each expectation.\nfunction judge(reply: string, expectations: string[]): boolean[] {\n  if (expectations.length === 0) return [];\n  const numbered = expectations.map((e, i) => `${i + 1}. ${e}`).join(\"\\n\");\n\n  const prompt = `You are grading an AI agent's reply against a list of expectations.\n\nFirst reason inside a single <thinking>…</thinking> block. Then, after the\nclosing tag, output STRICT JSON only:\n{\"results\":[{\"reason\":\"...\",\"met\":true}]}\none entry per expectation, in order.\n\n=== REPLY ===\n${reply}\n=== END ===\n\nExpectations:\n${numbered}`;\n\n  const res = spawnSync({\n    cmd: [\n      \"claude\", \"-p\", prompt,\n      \"--model\", \"claude-haiku-4-5\",   // small + cheap is plenty for grading\n      \"--output-format\", \"json\",\n      \"--permission-mode\", \"bypassPermissions\",\n      \"--max-budget-usd\", \"0.10\",\n    ],\n    stdout: \"pipe\", stderr: \"pipe\", timeout: 180_000,\n  });\n\n  // strip the <thinking> block, then grab the JSON object\n  const text = (JSON.parse(res.stdout.toString()).result ?? \"\")\n    .replace(/<thinking>[\\s\\S]*?<\\/thinking>/gi, \"\");\n  const json = JSON.parse(text.slice(text.indexOf(\"{\"), text.lastIndexOf(\"}\") + 1));\n  return expectations.map((_, i) => json.results?.[i]?.met === true);\n}\n```\n\nRead the judge's homework\n\nA judged eval is only as trustworthy as the judge. The first few times, log the raw judge output and read its reasoning. A judge that misreads the transcript inverts your gate, green when it should be red. Reading two samples is cheap insurance.\n\n## Step 5: run it more than once and vote\n\nRun an agent once and a pass might be luck, a fail might be a bad roll. The fix is to run each case a few times and decide by majority. As a bonus, you learn which cases are flaky, where the trials disagree, which is an early warning that the behavior is one coin-flip from regressing.\n\n```\n// Run `fn` N times, return how many returned true + whether the majority did.\nfunction vote(trials: number, fn: () => boolean) {\n  let correct = 0;\n  for (let i = 0; i < trials; i++) if (fn()) correct++;\n  return {\n    correct,\n    passed: correct * 2 > trials,              // strict majority\n    flaky: correct > 0 && correct < trials,     // trials disagreed\n  };\n}\n```\n\nThis is the one place evals cost real money, three runs is three times the spend, so it’s a pre-release pass, not an every-keystroke check. Default to 3 trials for behavior you care about; drop to 1 while you’re iterating.\n\n## Putting it together\n\nNow the spine. Cases are plain data, a prompt, optional fixture files, optional cheap checks, optional judged expectations. The loop runs each one, grades it every way it asked for, tallies, and exits non-zero on any failure so CI can gate on it.\n\n```\ntype EvalCase = {\n  id: string;\n  prompt: string;\n  files?: Record<string, string>;\n  checks?: Checks;\n  expectations?: string[];\n};\n\nconst cases: EvalCase[] = [\n  {\n    id: \"recommends-planning-first\",\n    prompt: \"I want to add team billing. What should I do first?\",\n    checks: {\n      required_substrings: [\"plan\"],\n      forbidden_substrings: [\"just start coding\"],\n    },\n    expectations: [\n      \"Recommends clarifying or writing a plan before implementation\",\n      \"Does not start writing code immediately\",\n    ],\n  },\n];\n\nlet pass = 0, fail = 0, spent = 0;\n\nfor (const c of cases) {\n  console.log(`\\n▶ ${c.id}`);\n\n  // each trial runs in its own fresh sandbox\n  const result = vote(3, () => {\n    const dir = makeSandbox(c.files ?? {});\n    const run = runAgent(c.prompt, dir);\n    spent += run.cost;\n    if (!run.ok) return false;\n\n    const checks = checkAssertions(c.checks ?? {}, run.text, dir);\n    const judged = judge(run.text, c.expectations ?? [])\n      .map((met, i) => [`expectation ${i + 1}`, met] as [string, boolean]);\n\n    const all = [...checks, ...judged];\n    for (const [label, ok] of all) console.log(`   ${ok ? \"✓\" : \"✗\"} ${label}`);\n    return all.every(([, ok]) => ok);\n  });\n\n  console.log(`  ${result.passed ? \"PASS\" : \"FAIL\"} ${result.correct}/3${result.flaky ? \"  (flaky)\" : \"\"}`);\n  result.passed ? pass++ : fail++;\n}\n\nconsole.log(`\\n${pass} passed, ${fail} failed — $${spent.toFixed(4)}`);\nprocess.exit(fail > 0 ? 1 : 0);\n```\n\nWire it into `package.json`\n\nso it’s one command:\n\n```\n\"scripts\": {\n  \"test:evals\": \"bun evals.ts\"\n}\n```\n\nThen:\n\n```\nbun run test:evals\n\n# ▶ recommends-planning-first\n#    ✓ contains \"plan\"\n#    ✓ excludes \"just start coding\"\n#    ✓ expectation 1\n#    ✓ expectation 2\n#    ✓ contains \"plan\"\n#    ✓ excludes \"just start coding\"\n#    ✓ expectation 1\n#    ✓ expectation 2\n#    ✗ contains \"plan\"\n#    ✓ excludes \"just start coding\"\n#    ✓ expectation 1\n#    ✓ expectation 2\n#   PASS 2/3  (flaky)\n#\n# 1 passed, 0 failed — $1.0247\n```\n\nThat’s the real output from running this exact file against the live CLI, not a cleaned-up screenshot. Notice it came back `2/3 (flaky)`\n\n: on the third trial the agent gave a good answer that happened not to use the literal word “plan”, so the `required_substrings: [\"plan\"]`\n\ncheck failed while the judge’s semantic expectations still passed all three times. The vote saved the pass, and the `(flaky)`\n\nflag surfaced the brittleness at the heart of this: a single run would have been a coin-flip, and the strict substring check is narrower than the behavior we care about. That one run cost $1.02, three trials with a judge each.\n\nThat’s the whole harness. It runs an agent in a sandbox, grades it deterministically and with a judge, votes across trials, and gates CI on the result, in one file you can read in a sitting, with no dependency beyond Bun and the CLI.\n\nBuilding the imports up step by step (the `node:fs`\n\nhelpers appear as each one is needed) reads well as a tutorial but leaves the pieces scattered. Below is the whole thing assembled into one copy-paste-ready file, the version that produced the output above:\n\n## Full evals.ts — the complete, runnable file\n\n``` js\n// evals.ts\nimport { spawnSync } from \"bun\";\nimport { mkdtempSync, mkdirSync, writeFileSync, existsSync } from \"node:fs\";\nimport { tmpdir } from \"node:os\";\nimport { join, dirname } from \"node:path\";\n\n// Run the agent on `prompt` inside `cwd`; return its final reply.\nfunction runAgent(prompt: string, cwd: string) {\n  const res = spawnSync({\n    cmd: [\n      \"claude\", \"-p\", prompt,\n      \"--output-format\", \"json\",          // one JSON envelope on stdout\n      \"--permission-mode\", \"bypassPermissions\", // don't prompt us mid-run\n      \"--max-budget-usd\", \"0.50\",           // hard safety cap per run\n    ],\n    cwd,\n    stdout: \"pipe\",\n    stderr: \"pipe\",\n    timeout: 180_000,\n  });\n\n  const envelope = JSON.parse(res.stdout.toString());\n  return {\n    text: envelope.result ?? \"\",\n    ok: res.exitCode === 0 && envelope.is_error !== true,\n    cost: Number(envelope.total_cost_usd ?? 0),\n  };\n}\n\n// Make a throwaway git repo seeded with `files`; return its path.\nfunction makeSandbox(files: Record<string, string>) {\n  const dir = mkdtempSync(join(tmpdir(), \"eval-\"));\n  spawnSync({ cmd: [\"git\", \"init\", \"-q\"], cwd: dir });\n  for (const [path, content] of Object.entries(files)) {\n    const target = join(dir, path);\n    mkdirSync(dirname(target), { recursive: true });\n    writeFileSync(target, content);\n  }\n  return dir;\n}\n\nconst has = (haystack: string, needle: string) =>\n  haystack.toLowerCase().includes(needle.toLowerCase());\n\ntype Checks = {\n  required_substrings?: string[];   // must appear in the reply\n  forbidden_substrings?: string[];  // must NOT appear\n  required_files?: string[];        // must exist in the sandbox after the run\n};\n\n// Returns [label, passed] for each check.\nfunction checkAssertions(checks: Checks, reply: string, dir: string) {\n  const out: [string, boolean][] = [];\n  for (const s of checks.required_substrings ?? [])\n    out.push([`contains \"${s}\"`, has(reply, s)]);\n  for (const s of checks.forbidden_substrings ?? [])\n    out.push([`excludes \"${s}\"`, !has(reply, s)]);\n  for (const f of checks.required_files ?? [])\n    out.push([`created ${f}`, existsSync(join(dir, f))]);\n  return out;\n}\n\n// Ask a cheap model whether `reply` meets each expectation.\nfunction judge(reply: string, expectations: string[]): boolean[] {\n  if (expectations.length === 0) return [];\n  const numbered = expectations.map((e, i) => `${i + 1}. ${e}`).join(\"\\n\");\n\n  const prompt = `You are grading an AI agent's reply against a list of expectations.\n\nFirst reason inside a single <thinking>…</thinking> block. Then, after the\nclosing tag, output STRICT JSON only:\n{\"results\":[{\"reason\":\"...\",\"met\":true}]}\none entry per expectation, in order.\n\n=== REPLY ===\n${reply}\n=== END ===\n\nExpectations:\n${numbered}`;\n\n  const res = spawnSync({\n    cmd: [\n      \"claude\", \"-p\", prompt,\n      \"--model\", \"claude-haiku-4-5\",   // small + cheap is plenty for grading\n      \"--output-format\", \"json\",\n      \"--permission-mode\", \"bypassPermissions\",\n      \"--max-budget-usd\", \"0.10\",\n    ],\n    stdout: \"pipe\", stderr: \"pipe\", timeout: 180_000,\n  });\n\n  // strip the <thinking> block, then grab the JSON object\n  const text = (JSON.parse(res.stdout.toString()).result ?? \"\")\n    .replace(/<thinking>[\\s\\S]*?<\\/thinking>/gi, \"\");\n  const json = JSON.parse(text.slice(text.indexOf(\"{\"), text.lastIndexOf(\"}\") + 1));\n  return expectations.map((_, i) => json.results?.[i]?.met === true);\n}\n\n// Run `fn` N times, return how many returned true + whether the majority did.\nfunction vote(trials: number, fn: () => boolean) {\n  let correct = 0;\n  for (let i = 0; i < trials; i++) if (fn()) correct++;\n  return {\n    correct,\n    passed: correct * 2 > trials,              // strict majority\n    flaky: correct > 0 && correct < trials,     // trials disagreed\n  };\n}\n\ntype EvalCase = {\n  id: string;\n  prompt: string;\n  files?: Record<string, string>;\n  checks?: Checks;\n  expectations?: string[];\n};\n\nconst cases: EvalCase[] = [\n  {\n    id: \"recommends-planning-first\",\n    prompt: \"I want to add team billing. What should I do first?\",\n    checks: {\n      required_substrings: [\"plan\"],\n      forbidden_substrings: [\"just start coding\"],\n    },\n    expectations: [\n      \"Recommends clarifying or writing a plan before implementation\",\n      \"Does not start writing code immediately\",\n    ],\n  },\n];\n\nlet pass = 0, fail = 0, spent = 0;\n\nfor (const c of cases) {\n  console.log(`\\n▶ ${c.id}`);\n\n  // each trial runs in its own fresh sandbox\n  const result = vote(3, () => {\n    const dir = makeSandbox(c.files ?? {});\n    const run = runAgent(c.prompt, dir);\n    spent += run.cost;\n    if (!run.ok) return false;\n\n    const checks = checkAssertions(c.checks ?? {}, run.text, dir);\n    const judged = judge(run.text, c.expectations ?? [])\n      .map((met, i) => [`expectation ${i + 1}`, met] as [string, boolean]);\n\n    const all = [...checks, ...judged];\n    for (const [label, ok] of all) console.log(`   ${ok ? \"✓\" : \"✗\"} ${label}`);\n    return all.every(([, ok]) => ok);\n  });\n\n  console.log(`  ${result.passed ? \"PASS\" : \"FAIL\"} ${result.correct}/3${result.flaky ? \"  (flaky)\" : \"\"}`);\n  result.passed ? pass++ : fail++;\n}\n\nconsole.log(`\\n${pass} passed, ${fail} failed — $${spent.toFixed(4)}`);\nprocess.exit(fail > 0 ? 1 : 0);\n```\n\nProve it works: write it red first\n\nBefore you trust a case, watch it fail. Point it at a behavior the agent doesn’t have yet and confirm it goes red for the right reason. An eval written after the behavior already works might be asserting on nothing, and you’d never know, because it’s green from birth.\n\n## Testing your own Claude Code skill\n\nSo far the system-under-test was a bare agent answering a prompt. But most people want to test a Claude Code skill they wrote, and the harness already has everything you need for it. A skill is a `SKILL.md`\n\nfile with two frontmatter fields, a `name`\n\nand a `description`\n\n, plus instructions in the body. Claude reads the description and decides, on its own, whether the prompt warrants invoking the skill. That gives you two distinct things to test:\n\n**Does it trigger?** Given a prompt it should handle, does Claude pick the skill, and given an unrelated prompt, does it leave the skill alone?**Does it behave?** Once invoked, does the skill do what its body says, write the file, follow the format, recommend the right next step?\n\nThe trick that makes this fall out of what we already built: **the fixture installs the skill.** Claude Code discovers project skills from `.claude/skills/<name>/SKILL.md`\n\nrelative to the working directory, and our harness already runs `claude -p`\n\nwith `cwd`\n\nset to a fresh sandbox. So if you seed the skill file into `fixture.files`\n\n, it’s live inside that throwaway repo, no global install, no plugin packaging, repeatable. The same `makeSandbox`\n\nyou wrote for fixtures now ships the system-under-test.\n\nTake a tiny skill, so the whole thing fits on screen: it answers questions in rhyme, and emits a marker token so a test can see it ran.\n\n``` js\n// The skill under test, as a one-file fixture.\nconst RHYME_SKILL = `---\nname: rhyme-reply\ndescription: Use whenever the user asks a question and wants the answer to\n  rhyme, or mentions \"rhyme\", \"in verse\", or \"as a poem\".\n---\n# Rhyme Reply\nWhen invoked, answer the question as a short rhyming couplet, and begin your\nreply with the marker token RHYME_SKILL_ACTIVE so a test can see the skill ran.\n`;\n\nconst skillCases: EvalCase[] = [\n  {\n    id: \"rhyme-skill-triggers\",\n    prompt: \"What causes rain? Answer as a rhyme.\",\n    files: { \".claude/skills/rhyme-reply/SKILL.md\": RHYME_SKILL },\n    checks: { required_substrings: [\"RHYME_SKILL_ACTIVE\"] }, // proof it fired\n    expectations: [\"The answer to the question rhymes\"],\n  },\n  {\n    id: \"rhyme-skill-stays-quiet\", // the over-trigger twin\n    prompt: \"What causes rain? Just explain it plainly in one sentence.\",\n    files: { \".claude/skills/rhyme-reply/SKILL.md\": RHYME_SKILL },\n    checks: { forbidden_substrings: [\"RHYME_SKILL_ACTIVE\"] }, // must NOT fire\n  },\n];\n```\n\nThese are plain `EvalCase`\n\nvalues, so they drop straight into the same `cases`\n\narray and run through the same loop, no new harness code. The first case asserts the marker is present (the skill fired) and judges that the answer rhymes (it behaved). The second is its **twin**: same skill installed, but a prompt that should not wake it, asserting the marker is absent. Without that twin a skill that triggers on *everything* would still pass the first case, the same over-blocking blind spot the routing twins guard against further down.\n\nRunning both against the live CLI, a single trial of each looks like this, the skill fires on the rhyme prompt and stays silent on the plain one:\n\n```\n▶ rhyme-skill-triggers\n   ✓ contains \"RHYME_SKILL_ACTIVE\"\n   ✓ expectation 1\n  PASS\n\n▶ rhyme-skill-stays-quiet\n   ✓ excludes \"RHYME_SKILL_ACTIVE\"\n  PASS\n```\n\nOne honest limitation: with plain `--output-format json`\n\nyou only see the final reply, so you’re inferring the skill fired from a fingerprint in its output (here, a marker token; for a real skill, the file it writes or the format it follows). That’s fine when the skill leaves a trace. To assert the *route* directly, that Claude selected this skill and not another, you need to see the tool calls, which is the `stream-json`\n\nupgrade the production harness makes next.\n\n## Where this goes in production\n\nThe harness above is the honest core. A production version adds polish, but nothing exotic. I run this same skeleton in [AFK](https://github.com/alexanderop/afk), an open-source Claude Code plugin whose skills route a coding task through plan → implement → clean up → verify. Its [ write-evals skill](https://github.com/alexanderop/afk/tree/main/skills/write-evals) ships a\n\n[self-contained](https://github.com/alexanderop/afk/blob/main/skills/write-evals/run-evals.template.ts), the grown-up version of the file we built above, and the live suite it runs lives under\n\n`run-evals.template.ts`\n\n[. Three things it adds, all visible in that code:](https://github.com/alexanderop/afk/tree/main/tests/e2e/evals)\n\n`tests/e2e/evals/`\n\n**Cases are data, not code.** Instead of a TypeScript array, each suite is a JSON file (`specs/<suite>/evals.json`\n\n) the runner loads. A case is the same shape you already know (a prompt, an optional fixture, deterministic assertions, optional judged expectations), only declared, so non-programmers can add coverage and the runner never changes:\n\n```\n{\n  \"id\": \"grill-plan-records-reference-repo\",\n  \"prompt\": \"Earlier we cloned https://github.com/acme/awesome-streamer into reference/awesome-streamer to copy its SSE pattern. Finish by writing docs/plans/streaming.md for a /chat SSE endpoint that follows that repo.\",\n  \"fixture\": {\n    \"files\": {\n      \"reference/awesome-streamer/README.md\": \"Source: https://github.com/acme/awesome-streamer\\n\"\n    }\n  },\n  \"expectations\": [\n    \"Records in the plan that a reference repo was cloned to copy a pattern\",\n    \"Points implementation at the real cloned source rather than memory\"\n  ],\n  \"assertions\": {\n    \"required_files\": [\"docs/plans/streaming.md\"],\n    \"required_file_substrings\": {\n      \"docs/plans/streaming.md\": [\"reference/awesome-streamer\", \"github.com/acme/awesome-streamer\"]\n    }\n  }\n}\n```\n\nNote the two-part requirement (record the clone *and* point at the real source) split into two assertions, so a half-right plan can’t pass.\n\n**A dedicated routing case type.** Most agent behavior is “which path did it pick?”, which a substring check grades, no judge needed. AFK marks those `kind:\"routing\"`\n\nand grades them on `expect`\n\n/ `forbid`\n\nlists. One real example: when a plan already exists and there’s no diff yet, the help skill should point you at `afk:implement`\n\n, not back at planning or forward to QA:\n\n```\n{\n  \"id\": \"help-after-plan\",\n  \"prompt\": \"What now? Assume docs/plans/checkout.md exists and there is no implementation diff.\",\n  \"expected_output\": \"Recommends afk:implement as the next step.\",\n  \"kind\": \"routing\",\n  \"fixture\": {\n    \"files\": {\n      \"docs/plans/checkout.md\": \"# Checkout Plan\\n\\n## Tasks\\n1. Implement checkout.\\n\"\n    }\n  },\n  \"routing\": {\n    \"expect\": [\"afk:implement\"],\n    \"forbid\": [\"Next step: [Q]\", \"run afk:qa now\"]\n  }\n}\n```\n\nA trial is correct only if every `expect`\n\nstring is present and no `forbid`\n\nstring is. The case then passes by strict majority across trials, the same `vote()`\n\nlogic from Step 5, code-graded and judge-free. Each safety gate gets an `overblock_guard:true`\n\n“should-proceed” twin so an over-cautious agent that blocks everything can’t hide: failing a twin is tallied as an over-block, not a miss.\n\n**Richer transcripts and saved artifacts.** It runs the agent with `--output-format stream-json --verbose`\n\nand reconstructs the transcript from the event stream, so the judge sees every tool the agent called, not only the final reply. That’s what you need when “did it read the repo first?” is the behavior. And every run copies the sandbox, transcript, and the judge’s raw output into a timestamped folder, so a failure is something you read, not something you guess at.\n\n## What else you can point it at\n\nA skill is just one thing you can put under test. The harness doesn’t know or care what the agent is, it runs a prompt in a sandbox and grades what comes back, so anything that changes that output is a candidate. A few that have earned their keep:\n\n**Your CLAUDE.md and house rules.** You write “always use pnpm, never npm” or “put new components under\n\n`src/features/`\n\n” and then hope the agent obeys. Seed the rules file into the fixture, prompt for the task, and assert the convention held, that the command says `pnpm`\n\n, that the file landed in the right folder. Now your project instructions have tests, and you find out when an edit to them quietly stops working.**A prompt you’re tuning.** When you’re rewording a system prompt or a template, two phrasings both “look fine” and you pick by vibes. Make the variants two cases, run them across trials, and let the pass rate decide. The vote turns “I think this wording is better” into a number.\n\n**A model upgrade.** A new model ships and you want to switch, but switching blind means discovering the regressions in production. Point the existing suite at the new model with one `--model`\n\nflag, diff the pass rates against the old one, and you’ll see exactly which behaviors got better and which quietly broke before any user does.\n\n**An MCP server or custom tool.** Give the agent a prompt that should make it call your tool, run with `stream-json`\n\nso the transcript shows the tool calls, and assert it called the right one with sane arguments, and left the wrong ones alone. Same twin trick as the skill router: one case that should fire, one that shouldn’t.\n\n**Refusals and guardrails.** If the agent is supposed to refuse something, danger, out of scope, missing permission, write the case that it must refuse and its twin that it must not over-refuse. This is the `overblock_guard`\n\npattern from above, and it’s the only way to keep a guardrail from slowly strangling legitimate work.\n\n**Subagents, hooks, and slash commands.** Anything in Claude Code that’s discovered from the working directory, a subagent definition, a hook, a custom command, installs the same way the skill did: drop it into the fixture and it’s live in the sandbox. The system-under-test is always just a file you seed.\n\nThe pattern underneath all of these is the same: pin one observable behavior, seed whatever makes it real into a throwaway sandbox, grade the cheap way where you can and the judge way where you can’t, and run it enough times to trust the result. Once you have that loop, the question stops being “can I test this?” and becomes “what’s the behavior I care about?”, which is the one worth asking.\n\nThe one real downside is cost. Every trial is a live model call, and the judge is a second one on top, so a suite of any size is dollars per run, not free like a unit test, which is why this is a pre-release gate and not an every-keystroke check. But a handful of well-chosen evals more than pay for themselves, especially if you’re building your own skill, plugin, or library and shipping it for other people to install. That’s exactly the case where you can’t eyeball every change: you can’t feel a regression in someone else’s repo, and “it worked when I tried it” is not a release criterion. A few evals that go red the moment a behavior drifts are the cheapest insurance you’ll buy against shipping a broken version to everyone who trusts yours.", "url": "https://wpnews.pro/news/build-your-own-eval-harness-from-scratch-with-bun-and-claude-p", "canonical_source": "https://alexop.dev/posts/build-your-own-eval-harness-bun-claude-p/", "published_at": "2026-06-16 19:15:37+00:00", "updated_at": "2026-06-16 19:19:01.945187+00:00", "lang": "en", "topics": ["ai-agents", "developer-tools", "artificial-intelligence", "large-language-models"], "entities": ["Bun", "Claude", "Anthropic"], "alternates": {"html": "https://wpnews.pro/news/build-your-own-eval-harness-from-scratch-with-bun-and-claude-p", "markdown": "https://wpnews.pro/news/build-your-own-eval-harness-from-scratch-with-bun-and-claude-p.md", "text": "https://wpnews.pro/news/build-your-own-eval-harness-from-scratch-with-bun-and-claude-p.txt", "jsonld": "https://wpnews.pro/news/build-your-own-eval-harness-from-scratch-with-bun-and-claude-p.jsonld"}}