Honest Perf Benchmarks for a Paid-API Compiler Implementation of a benchmark suite for `intentional-cognition-os`, a TypeScript compiler that uses paid Claude APIs. The benchmarks address key design challenges, including a deterministic, seeded random corpus generator to ensure reproducible results across runs, and a two-env-var gate (`ANTHROPIC_API_KEY` and `ICO_BENCH_INCLUDE_CLAUDE`) to prevent accidental API token consumption. The suite also handles skipped scenarios by recording them with a `skipReason` rather than omitting them, allowing for clear differentiation between regressions, improvements, and tests that did not run. intentional-cognition-os is a TypeScript "compiler" — markdown sources go in one end, a structured artifact comes out the other, and several of the middle stages call paid Claude APIs to do the cognitive work. Up to today there were zero performance gates on any of it. No baseline, no regression alarm, no "did that refactor make ingest 4× slower" check. The benchmark suite that landed across four PRs answers two design questions that had to be settled before a single line of timing code got written: - How do you compare numbers across machines when half the corpus is randomly generated text? - What do you do about the steps that cost real money on every run? Get either answer wrong and the benchmark suite is worse than no benchmark suite — it produces numbers that look authoritative and aren't. The corpus has to be byte-identical The first scenario — ingest — needs a corpus. Hand-curated fixtures committed to disk were considered and rejected: they don't scale, they go stale, and they encode whoever-wrote-them's idea of "representative." A generator is the right answer, but a generator has to be deterministic or before/after diffs are noise. The generator uses a seeded mulberry32 PRNG and pulls UUIDs from the same stream: js function mulberry32 seed: number { return function { let t = seed += 0x6d2b79f5 ; t = Math.imul t ^ t 15 , t | 1 ; t ^= t + Math.imul t ^ t 7 , t | 61 ; return t ^ t 14 0 / 4294967296; }; } function seededUuidV4 rand: = number : string { // 16 bytes from the seeded stream, version + variant nibbles set per RFC 4122 const bytes = new Uint8Array 16 ; for let i = 0; i < 16; i++ bytes i = Math.floor rand 256 ; bytes 6 = bytes 6 & 0x0f | 0x40; bytes 8 = bytes 8 & 0x3f | 0x80; return formatUuid bytes ; } The non-obvious trap is crypto.randomUUID . It would have looked correct, passed every unit test, and silently produced different UUIDs on every run — so every "identical" corpus would have differed in the front-matter id field. That breaks ingest's content-hash cache in different ways on different machines. Same seed, same count, same body-word count yields byte-identical output everywhere. That's the contract. One more gotcha worth a sentence: the corpus generator writes front matter through gray-matter , which quotes string values. The compiler's wiki-page validator uses a hand-rolled YAML parser that does NOT strip quotes — so wiki fixtures emit all values unquoted. A quoted compiled at would arrive at Zod's datetime check with literal " characters in it and fail. Two parsers, two rules, documented inline at the parser boundary. An API key is not consent The render, compile, and ask scenarios call Claude. Running them on every CI pass would either drain a budget or quietly stop running when the budget hit zero. Neither is acceptable. The gate is two env vars, both required: ANTHROPIC API KEY=sk-ant-... \ ICO BENCH INCLUDE CLAUDE=1 \ pnpm --filter @ico/benchmarks bench From PR 70's design notes, kept verbatim because the framing matters: The double gate is intentional. An API key alone is not consent — many developers have it set for normal CLI use. ICO BENCH INCLUDE CLAUDE is the explicit "yes, burn tokens on this benchmark run" signal. This pattern shows up elsewhere — CI=true plus RUN E2E=1 , prod credentials plus --really-really-yes . The shape is the same: one signal proves capability, the second proves intent. A single-gate design fails open the first time someone forgets which shell they're in. Skipped is not zero The interesting design call was what to do when the gate is closed. The wrong answers: - Don't run, don't record. Trend tooling then can't tell "we stopped running render" from "render still passes." - Record a zero. Trend tooling thinks render got infinitely fast and stops alarming. The right answer: record the scenario as skipped: true with a stable skipReason . ScenarioRecord is Partial