Build Your Own Eval Harness from Scratch with Bun and Claude -p A developer built a custom AI agent evaluation harness in a single Bun file using the Claude CLI, avoiding external frameworks or SaaS platforms. The harness runs the agent in a sandbox, grades outputs with string checks and LLM-as-judge, and loops through test cases with pass/fail reporting. The approach demonstrates that effective eval systems require only a runtime, a grading method, and a loop. You don’t need a framework, a SaaS dashboard, or a dependency to test an AI agent. You need a way to run it, a way to grade it, and a loop around both. Here we build an eval harness in a single Bun file, start to finish, every line explained. By the end you’ll have one evals.ts file that spins up a sandbox, drives the agent through the claude CLI, and grades the result three ways. What we’re building An eval is a test for software that isn’t deterministic. A unit test asks “does 2 + 2 return 4?”, but an AI agent gives you a different paragraph every time you ask, so there’s no single value to assert against. An eval instead pins down one observable behavior “when there’s no plan yet, it recommends planning first” and checks whether the agent did it, while tolerating the fact that the exact words vary. People reach for hosted platforms for this. You don’t have to. Every eval harness, underneath the dashboard, is the same three moves: Run the agent. Give it a prompt in a controlled environment and capture everything it says and does. Grade the result. Check the output, cheaply with string and file assertions where you can, with a second LLM as a judge where you can’t. Loop and report. Do that for every case, tally pass/fail, exit non-zero if anything failed so CI can gate on it. Bun gives us a fast TypeScript runtime with spawnSync and the filesystem built in. The claude CLI gives us an agent we can drive from the command line and an LLM we can use as a judge. That’s everything. You’ll end up with a single evals.ts file, roughly 150 lines, that you run with bun evals.ts , built one piece at a time. If you want to understand the thing on the other end of the harness too, I wrote a companion piece on building your own coding agent from scratch /posts/building-your-own-coding-agent-from-scratch/ Building Your Own Coding Agent from Scratch A practical guide to creating a minimal Claude-powered coding assistant in TypeScript. Start with a basic chat loop and progressively add tools until you have a fully functional coding agent in about 400 lines. . Setup: Bun and the claude CLI Two prerequisites, both one-liners: 1. Bun — the runtime that runs our harness curl -fsSL https://bun.sh/install | bash 2. The Claude Code CLI — the agent we're testing, and our judge npm install -g @anthropic-ai/claude-code sanity check: this should print a model's reply claude -p "say hi in three words" --output-format json The key flag we’ll lean on is --output-format json , which makes the CLI print one machine-readable envelope instead of a stream of human text. Make a folder, drop in an empty evals.ts , and let’s fill it. Step 1: drive the agent from code First, a function that runs the agent on a prompt and hands back its reply. We shell out to claude -p the “print” / non-interactive mode and parse the JSON envelope it prints. That envelope carries the final text in result , the dollar cost in total cost usd , and an is error flag. js // evals.ts import { spawnSync } from "bun"; // Run the agent on prompt inside cwd ; return its final reply. function runAgent prompt: string, cwd: string { const res = spawnSync { cmd: "claude", "-p", prompt, "--output-format", "json", // one JSON envelope on stdout "--permission-mode", "bypassPermissions", // don't prompt us mid-run "--max-budget-usd", "0.50", // hard safety cap per run , cwd, stdout: "pipe", stderr: "pipe", timeout: 180 000, } ; const envelope = JSON.parse res.stdout.toString ; return { text: envelope.result ?? "", ok: res.exitCode === 0 && envelope.is error == true, cost: Number envelope.total cost usd ?? 0 , }; } Step 2: give it a sandbox to act in Letting an agent loose in your real repo is a bad idea, and it makes runs non-repeatable. Instead, every case gets a fresh throwaway git repo seeded with the files that behavior needs, a fixture. When the run is done, you can inspect or delete it. js import { mkdtempSync, mkdirSync, writeFileSync } from "node:fs"; import { tmpdir } from "node:os"; import { join, dirname } from "node:path"; // Make a throwaway git repo seeded with files ; return its path. function makeSandbox files: Record