{"slug": "honest-perf-benchmarks-for-a-paid-api-compiler", "title": "Honest Perf Benchmarks for a Paid-API Compiler", "summary": "Implementation of a benchmark suite for `intentional-cognition-os`, a TypeScript compiler that uses paid Claude APIs. The benchmarks address key design challenges, including a deterministic, seeded random corpus generator to ensure reproducible results across runs, and a two-env-var gate (`ANTHROPIC_API_KEY` and `ICO_BENCH_INCLUDE_CLAUDE`) to prevent accidental API token consumption. The suite also handles skipped scenarios by recording them with a `skipReason` rather than omitting them, allowing for clear differentiation between regressions, improvements, and tests that did not run.", "body_md": "`intentional-cognition-os`\n\nis a TypeScript \"compiler\" — markdown sources go in one end, a structured artifact comes out the other, and several of the middle stages call paid Claude APIs to do the cognitive work. Up to today there were zero performance gates on any of it. No baseline, no regression alarm, no \"did that refactor make ingest 4× slower\" check.\n\nThe benchmark suite that landed across four PRs answers two design questions that had to be settled before a single line of timing code got written:\n\n- How do you compare numbers across machines when half the corpus is randomly generated text?\n- What do you do about the steps that cost real money on every run?\n\nGet either answer wrong and the benchmark suite is worse than no benchmark suite — it produces numbers that look authoritative and aren't.\n\n## The corpus has to be byte-identical\n\nThe first scenario — `ingest`\n\n— needs a corpus. Hand-curated fixtures committed to disk were considered and rejected: they don't scale, they go stale, and they encode whoever-wrote-them's idea of \"representative.\" A generator is the right answer, but a generator has to be deterministic or before/after diffs are noise.\n\nThe generator uses a seeded `mulberry32`\n\nPRNG and pulls UUIDs from the same stream:\n\n``` js\nfunction mulberry32(seed: number) {\n  return function () {\n    let t = (seed += 0x6d2b79f5);\n    t = Math.imul(t ^ (t >>> 15), t | 1);\n    t ^= t + Math.imul(t ^ (t >>> 7), t | 61);\n    return ((t ^ (t >>> 14)) >>> 0) / 4294967296;\n  };\n}\n\nfunction seededUuidV4(rand: () => number): string {\n  // 16 bytes from the seeded stream, version + variant nibbles set per RFC 4122\n  const bytes = new Uint8Array(16);\n  for (let i = 0; i < 16; i++) bytes[i] = Math.floor(rand() * 256);\n  bytes[6] = (bytes[6] & 0x0f) | 0x40;\n  bytes[8] = (bytes[8] & 0x3f) | 0x80;\n  return formatUuid(bytes);\n}\n```\n\nThe non-obvious trap is `crypto.randomUUID`\n\n. It would have looked correct, passed every unit test, and silently produced different UUIDs on every run — so every \"identical\" corpus would have differed in the front-matter `id`\n\nfield. That breaks ingest's content-hash cache in different ways on different machines. Same seed, same count, same body-word count yields byte-identical output everywhere. That's the contract.\n\nOne more gotcha worth a sentence: the corpus generator writes front matter through `gray-matter`\n\n, which quotes string values. The compiler's wiki-page validator uses a hand-rolled YAML parser that does NOT strip quotes — so wiki fixtures emit all values unquoted. A quoted `compiled_at`\n\nwould arrive at Zod's datetime check with literal `\"`\n\ncharacters in it and fail. Two parsers, two rules, documented inline at the parser boundary.\n\n## An API key is not consent\n\nThe render, compile, and ask scenarios call Claude. Running them on every CI pass would either drain a budget or quietly stop running when the budget hit zero. Neither is acceptable.\n\nThe gate is two env vars, both required:\n\n```\nANTHROPIC_API_KEY=sk-ant-... \\\nICO_BENCH_INCLUDE_CLAUDE=1 \\\npnpm --filter @ico/benchmarks bench\n```\n\nFrom PR #70's design notes, kept verbatim because the framing matters:\n\nThe double gate is intentional. An API key alone is not consent — many developers have it set for normal CLI use.\n\n`ICO_BENCH_INCLUDE_CLAUDE`\n\nis the explicit \"yes, burn tokens on this benchmark run\" signal.\n\nThis pattern shows up elsewhere — `CI=true`\n\nplus `RUN_E2E=1`\n\n, prod credentials plus `--really-really-yes`\n\n. The shape is the same: one signal proves capability, the second proves intent. A single-gate design fails open the first time someone forgets which shell they're in.\n\n## Skipped is not zero\n\nThe interesting design call was what to do when the gate is closed. The wrong answers:\n\n- Don't run, don't record. Trend tooling then can't tell \"we stopped running render\" from \"render still passes.\"\n- Record a zero. Trend tooling thinks render got infinitely fast and stops alarming.\n\nThe right answer: record the scenario as `skipped: true`\n\nwith a stable `skipReason`\n\n. `ScenarioRecord`\n\nis `Partial<CommonTiming>`\n\nso the timing fields legitimately don't exist on skipped records:\n\n```\n{\n  \"name\": \"render\",\n  \"skipped\": true,\n  \"skipReason\": \"ICO_BENCH_INCLUDE_CLAUDE not set\",\n  \"git_sha\": \"9c14f02\",\n  \"node\": \"v22.21.0\",\n  \"platform\": \"linux-x64\"\n}\n```\n\nA baseline-comparison script can now answer three different questions instead of two: did this scenario regress, did it improve, or did it not run? Skipped runs stay visible in the JSON timeline. They don't pollute the histogram, but they prove the scenario still exists and the runner saw it.\n\n## The four PRs, briefly\n\n-\n**PR #68** scaffolded the`packages/benchmarks/`\n\nworkspace, the corpus generator, a`bench()`\n\ntimer with warmup + N-iteration median + RSS delta, and the runner that captures git SHA, Node version, and platform into`results/<iso>-<sha>.json`\n\n. The`results/`\n\ndirectory is gitignored except`.gitkeep`\n\n— baselines get tracked explicitly, not by accident. -\n**PR #69** added the`lint`\n\nscenario and moved`runLint`\n\n,`scanWikiPages`\n\n,`extractWikilinks`\n\n,`detectOrphans`\n\n,`LintResult`\n\n, and`SchemaError`\n\nout of`packages/cli/src/commands/lint.ts`\n\ninto a new`packages/compiler/src/lint.ts`\n\n. The function only composes compiler + kernel primitives and has no CLI dependency — it belonged in the compiler the whole time. The CLI's lint command shrunk to a thin wrapper around commander wiring and`renderLintReport`\n\n. Side fix:`extractWikilinks`\n\nhad a module-level`/g`\n\nregex whose`lastIndex`\n\ncarried state between calls — the same class of bug that landed in PR #67 the day before. Fixed by constructing the regex per call. -\n**PR #70** added the`render`\n\nscenario and the double-gate. -\n**PR #71** added`compile`\n\nand`ask`\n\n, each using the same gating pattern. Roughly 70 lines of additions across both files — the gate had already done the hard work.\n\n## Why not the obvious alternatives\n\nVitest's built-in `bench`\n\nwas considered. It does microbenchmarks well and integrates with the existing test runner. It does not produce the JSON timeline shape needed for cross-run comparison, and bolting that on means owning the storage layer anyway. Build it once, build it right.\n\nCommitting fixture corpora to disk was considered. They go stale, balloon the repo, and encode one author's idea of \"moderate.\" The seeded generator is reproducible AND parameterizable — same determinism guarantee, no committed binary blobs.\n\nRunning Claude scenarios always was considered for about a minute, then rejected on cost grounds. Even with caching, a benchmark suite that costs $2 per run on a busy day stops getting run.\n\n## What the numbers say\n\nThree scenarios ran on the dev box this afternoon (Claude-gated ones skipped because the opt-in wasn't set):\n\n| Scenario | Median | Target | Headroom |\n|---|---|---|---|\n| ingest (per-file, 50 sources × 500 words) | ~9 ms | < 2 s | 220× |\n| lint (50 sources + 30 wiki pages) | ~12 ms | < 30 s | 2400× |\n| render | SKIPPED (no opt-in) | — | recorded |\n\nThe headroom isn't the point — those targets are deliberately generous because the goal is regression detection, not perf bragging. The point is that there are now numbers to regress *against*.\n\n## Also shipped today\n\n**claude-code-plugins repo audit.** A 232-line audit landed at `266-RA-AUDT-repo-quality-audit-2026-05-17.md`\n\ncataloguing a broken `/about`\n\nroute, missing 404 handling, 14 stale `MS-OLDV`\n\nfiles still claiming v1.0.0 while the repo is at v4.30.0, and notebook content teaching the old 6-required-fields skill spec when the current spec requires 8. The first commit incorrectly flagged the wiki as empty, because `gh api repos/.../wiki`\n\nreturns 404 even when the wiki has content — that endpoint isn't a content probe, it's a metadata probe with bad error semantics. Followup commit cloned the wiki, found 23 pages, and refreshed all of them with current numbers. Lesson noted inline: don't use API existence probes as content probes. Clone and read.\n\n**claude-code-slack-channel threat model.** Added T11 (EchoLeak — instructions exfiltrated via legitimate-looking message replies) and invariant #7: admin verbs are not chat content. An operational key-management doc for the audit-signing key landed alongside the threat model update.\n\n## The transferable pattern\n\nFive scenarios in source tree, three actively measured, two gated behind explicit consent. The numbers that get reported are honest because the inputs are reproducible and the skipped runs are visible. Forget the opt-in flag and three scenarios show up as `skipped`\n\nin the JSON — they don't disappear, and they don't pretend to be zero.\n\nAny benchmark suite that mixes deterministic and paid steps needs all three pieces: a deterministic corpus that survives machine swaps, an opt-in gate strong enough to mean something, and a record shape that distinguishes \"didn't run\" from \"ran fast.\" Miss one and the suite will quietly lie to you the first time someone forgets which mode they're in. The lie is worse than the gap it filled.\n\n## Related posts\n\n-\n[Five Silent Failures in One Day](https://dev.to/posts/five-silent-failures-one-day/)— the regex`lastIndex`\n\nbug that re-appeared in PR #69 was one of these. -\n[Deterministic-First, LLM-Advisory CI](https://dev.to/posts/deterministic-first-llm-advisory-ci/)— same principle: the deterministic gate decides, the paid gate informs. -\n[Transitive CVE Clearance: A Dual-Layer Pattern](https://dev.to/posts/transitive-cve-clearance-dual-layer-pattern/)— the double-gate is the same shape as that two-layer defense.", "url": "https://wpnews.pro/news/honest-perf-benchmarks-for-a-paid-api-compiler", "canonical_source": "https://dev.to/jeremy_longshore/honest-perf-benchmarks-for-a-paid-api-compiler-56h4", "published_at": "2026-05-20 13:00:40+00:00", "updated_at": "2026-05-20 13:05:06.928173+00:00", "lang": "en", "topics": ["developer-tools", "artificial-intelligence", "large-language-models", "open-source", "data"], "entities": ["Claude", "TypeScript", "RFC 4122", "mulberry32"], "alternates": {"html": "https://wpnews.pro/news/honest-perf-benchmarks-for-a-paid-api-compiler", "markdown": "https://wpnews.pro/news/honest-perf-benchmarks-for-a-paid-api-compiler.md", "text": "https://wpnews.pro/news/honest-perf-benchmarks-for-a-paid-api-compiler.txt", "jsonld": "https://wpnews.pro/news/honest-perf-benchmarks-for-a-paid-api-compiler.jsonld"}}