{"slug": "token-warden-is-a-thrifty-office-manager-for-your-ai-assistants", "title": "Token-warden is a thrifty office manager for your AI assistants", "summary": "Token-warden, a Claude Code plugin, treats AI agent memory as an engineering problem by requiring every rule to prove it saves more tokens than it costs on a fixed benchmark or face eviction. The plugin uses a four-stage feed-forward loop to collect session data, distill candidate rules, benchmark them, and select only those with a measured positive return, making coding agents measurably cheaper over time.", "body_md": "**A Claude Code plugin that makes coding agents measurably cheaper over time.**\n\nMost \"agent memory\" accumulates advice nobody ever verifies. token-warden treats agent\nmemory as an engineering problem: every rule that wants space in an agent's context must\n**prove, on a fixed benchmark, that it saves more tokens than it costs** — or it gets\nevicted. The result is a per-agent memory file containing only rules with measured,\npositive return.\n\n**Measured, not vibes**— every rule carries a token delta from real benchmark runs** Self-funding**— rules must save ≥ 2× their own context rent to stay** Self-auditing**— active rules are re-benchmarked round-robin and evicted when they stop earning** Zero session overhead**— collection runs in a Stop hook that never blocks or fails your work\n\n[How it works](#how-it-works)[Getting started](#getting-started)[Commands](#commands)[The benchmark system](#the-benchmark-system)[Architecture](#architecture)[The agents](#the-agents)[Inter-agent approval gate](#inter-agent-approval-gate-experimental)[Design invariants](#design-invariants)[A recorded demonstration](#a-recorded-demonstration)[Testing](#testing)[Data layout](#data-layout)[Security notes](#security-notes)[Roadmap](#roadmap)\n\nThe optimizer is a four-stage, feed-forward loop. Lessons are extracted from finished sessions and applied to future ones — past work is never re-done.\n\n```\n                  agent session (any project, any repo)\n                                  │\n                                  │  Stop hook · parses the transcript:\n                                  │  tokens, tool calls, file re-reads, completion\n                                  ▼\n                ┌─────────────────────────────────────┐\n                │  1 · COLLECT                        │\n                │  one row per session in SQLite      │\n                └─────────────────────────────────────┘\n                                  │\n                                  │  fires only when a run exceeds the\n                                  │  agent's rolling p75 token cost\n                                  ▼\n                ┌─────────────────────────────────────┐\n                │  2 · DISTILL                        │\n                │  one haiku call over the waste      │\n                │  stats → 0–2 candidate rules        │\n                └─────────────────────────────────────┘\n                                  │\n                                  │  candidates wait in SQLite —\n                                  │  never injected until measured\n                                  ▼\n                ┌─────────────────────────────────────┐\n                │  3 · BENCH                          │\n                │  golden suite on a frozen fixture,  │\n                │  run with vs. without the candidate │\n                └─────────────────────────────────────┘\n                                  │\n                                  │  measured delta vs. context rent\n                                  ▼\n                ┌─────────────────────────────────────┐\n                │  4 · SELECT                         │\n                │  keep if savings ≥ 2× rent, else    │\n                │  evict · re-audit the oldest rule   │\n                └─────────────────────────────────────┘\n                                  │\n                                  ▼\n              ~/.claude/agent-memory/<agent>/MEMORY.md\n        compiled wholesale from surviving rules and injected\n            into the agent's system prompt next session\n```\n\n**1 · Collect.** `Stop`\n\nand `SubagentStop`\n\nhooks fire after every turn (main session and\nsubagent work respectively) and parse the session transcript\ninto one ledger row: input/output/cache tokens (deduplicated by API message id — the\ntranscript repeats usage per streamed block), tool-call count, files read more than once,\nand whether the session completed. The hook is hard-capped under the 2-second budget,\nwraps every failure, and exits 0 regardless — it can never break your session.\n\n**2 · Distill.** Collection is cheap, analysis is not, so analysis is rationed: only runs\nabove the agent's rolling 75th-percentile cost (minimum 5 prior runs) are distilled. A\nsingle detached haiku-tier call receives the waste statistics plus an 8 KB action trace\nand must return strict JSON: at most two one-sentence, generalizable rules. Invalid output\nis dropped, never retried. Near-duplicates of *any* existing rule — including evicted\nones — are rejected by trigram similarity, so a falsified rule cannot be re-proposed.\n\n**3 · Bench.** Candidates are measured on a golden task suite against a frozen fixture\nrepository (see [The benchmark system](#the-benchmark-system)). Each configuration runs\nthe suite headlessly in a throwaway copy with the candidate compiled into a temporary,\nfully isolated agent memory.\n\n**4 · Select.** A rule's verdict is the spec inequality: with `delta`\n\n= mean tokens saved\nper completed golden run and `rent`\n\n= the rule's own size in tokens, the rule goes active\niff `delta × sessions/week ≥ 2 × rent × sessions/week`\n\n. Failing a previously-passing task\nis instant eviction regardless of tokens. Every selector run also re-benchmarks the\nleast-recently-audited active rule — memory must keep earning its place. Survivors are\ncompiled into `MEMORY.md`\n\n, which Claude Code injects into the agent's system prompt.\n\n- Node.js 22+\n- Claude Code v2.1+ (\n`claude --version`\n\n) - macOS or Linux (Windows via WSL — benchmarks need a POSIX shell)\n\n```\ngit clone https://github.com/vukkt/token-warden.git\ncd token-warden\nnpm install        # the hooks run via the plugin's own tsx + better-sqlite3\n```\n\nFor the current session:\n\n```\nclaude --plugin-dir /path/to/token-warden\n```\n\nOr install persistently — this repository is also its own marketplace:\n\n```\n/plugin marketplace add vukkt/token-warden\n/plugin install token-warden@vukkt-plugins\n```\n\nMarketplace installs are copied to\n\n`~/.claude/plugins/cache`\n\nwithout`node_modules`\n\n. The Stop hook bootstraps its own dependencies on first run (one-time`npm install`\n\n, silent); collection begins from the second session at the latest.\n\nWork normally for a turn or two, then:\n\n```\n/warden-status\n```\n\nYou should see a `runs`\n\ncount for `main`\n\n. Every session in every project is now being\nmeasured into `~/.token-warden/warden.db`\n\n.\n\n```\nnpm run bench -- --agent all      # or one agent at a time\n```\n\nThis runs each agent's three golden tasks twice and freezes `run1_tokens`\n\n— the permanent\ndenominator of every future improvement claim. Do this once, before any rules exist.\n\nUse the four subagents (`frontend`\n\n, `backend`\n\n, `sql`\n\n, `testing`\n\n) for real work.\nExpensive sessions distill into candidates automatically. When `/warden-status`\n\nshows\ncandidates pending, measure them:\n\n```\nnpx tsx src/select.ts --agent sql\n```\n\nActive rules land in the agent's memory; the next session starts cheaper.\n\n| Command | What it does |\n|---|---|\n`/warden-status` |\nRead-only report: per-agent run/rule counts, suite total vs. frozen baseline (absolute + %), learning curve over time, active rules with measured deltas and provenance, recent evictions with reasons, real-work tokens by project, cross-agent question volume |\n`/warden-bench <agent|all> [--runs N] [--task id]` |\nRuns the golden suite, compares against `run1` and `best` , and reports benchmarking meta-cost (warns above 10% of the week's real-work tokens) |\n`/warden-select <agent> [--runs N] [--top-up N]` |\nMeasures pending candidates, evicts or activates them, re-audits the oldest active rule, and recompiles the agent's memory |\n`/warden-modelbench <agent> --model <id> [--baseline <id>] [--runs N]` |\nRuns the agent's golden suite under two models (candidate vs. the agent's current model, rules held constant) and reports which uses fewer tokens for that workload |\n`/warden-promptbench <agent> --variant <file.md> [--runs N]` |\nRuns the agent's golden suite under two prompts (a variant agent definition vs. the shipped one, rules and model held constant) and reports which uses fewer tokens |\n`/warden-evolve <agent> [--runs N]` |\nProposes a token-cheaper rewrite of the agent's prompt (model call), benchmarks it, and recommends it only if it provably wins — never auto-applied |\n\nWhen candidate rules are waiting, a lightweight `SessionStart`\n\nhook injects a one-line\nnudge into new sessions — selection itself always stays a user decision, because it\nspends real benchmark tokens.\n\nHeadless or when names collide, use the namespaced forms\n(`/token-warden:warden-status`\n\n). CLI equivalents:\n\n```\nnpx tsx src/status.ts                              # status report\nnpm run bench -- --agent sql [--rule N]            # benchmark runner\nnpx tsx src/select.ts --agent sql                  # selector (measure + evict + compile)\nnpx tsx src/modelbench.ts --agent sql --model haiku  # compare a model against the agent's default\nnpx tsx src/promptbench.ts --agent sql --variant v.md  # compare a prompt variant against the shipped one\nnpx tsx src/evolve.ts --agent sql                      # propose + measure a cheaper prompt variant\n```\n\nMeasurement is only as good as its control variables. token-warden controls them aggressively:\n\n**The fixture** (`benchmarks/fixture/`\n\n) is a small but realistic full-stack TypeScript\nproject — Express routes → services → repositories over SQLite, a React admin UI, a\npartial vitest suite — **frozen at Phase 2 and never modified**, so baselines stay\ncomparable across months. It ships with documented, deliberate flaws (`BUGS.md`\n\n, which\nagents never see: the benchmark runner excludes it from every copy) that the golden tasks\ntarget.\n\n**Golden tasks** (`benchmarks/<agent>/golden-NN.md`\n\n) — three per agent, each a frontmatter\nfile with a one-sentence `prompt`\n\nand a shell `success_check`\n\n(greps and/or a full\n`vitest run`\n\n). A run only counts as *completed* if its check passes: a cheap failed run\nis worse than an expensive successful one, and incomplete runs are excluded from all\nsavings math.\n\n**A benchmark run**, end to end:\n\n- Copy the fixture to a temp dir (\n`node_modules`\n\nsymlinked;`BUGS.md`\n\nexcluded). - Install the agent definition into the copy with its memory scope rewritten to\n`project`\n\n, so the compiled`MEMORY.md`\n\nunder test resolves*inside the temp dir*— real agent memory is never read or written by benchmarks. - Compile the rule set under test (active rules ± one candidate) into that memory.\n- Run\n`claude -p --agent <name>`\n\nheadlessly with**scoped permissions**:`acceptEdits`\n\nplus a Bash allowlist of test commands only — never`bypassPermissions`\n\n. - Run the\n`success_check`\n\n; parse the transcript; record one`runs`\n\nrow. - First-ever completed run per (agent, task) freezes\n`baselines.run1_tokens`\n\nforever; later completed runs only ratchet`best_tokens`\n\ndownward.\n\n**Variance and honesty.** Each configuration runs twice and pairs of runs differing by\nmore than 25% are flagged in the output. LLM variance is the dominant error source at\nsmall effect sizes — the recorded demonstration below shows it evicting a rule. The\nselector is variance-aware: it computes the standard error of the per-task savings, and\nwhen a verdict sits within one standard error of the keep/evict threshold it spends one\nbounded **top-up pass** (extra suite runs of the measured configuration, budget\nconfigurable via `--top-up`\n\n, default 1) before deciding; verdicts that remain within\nnoise are recorded with an explicit low-confidence annotation. The benchmark also\nreports its own **meta-cost** after every invocation: when benchmarking exceeds 10% of\nthe week's collected real-work tokens, it tells you to bench less.\n\n| Module | Responsibility |\n|---|---|\n`src/db.ts` |\nSQLite schema, versioned migrations (`PRAGMA user_version` ), typed query helpers |\n`src/transcript.ts` |\nPure transcript JSONL parser — usage dedup, tool calls, re-reads, completion heuristic, distiller digest |\n`src/collect.ts` |\nStop-hook entrypoint; p75 trigger; spawns the distiller detached |\n`src/distill.ts` |\nWaste analysis → 0–2 strict-JSON candidate rules; trigram dedupe |\n`src/bench.ts` |\nGolden-suite runner; baseline freezing; meta-cost accounting |\n`src/select.ts` |\nKeep/evict verdicts; round-robin re-audit; `MEMORY.md` compiler |\n`src/status.ts` |\nRead-only reporting behind `/warden-status` |\n`src/gate.ts` |\nInter-agent `SendMessage` approval gate (Agent Teams) |\n`src/notify.ts` |\nSessionStart nudge when candidates await measurement |\n`src/compare.ts` |\nGeneric A/B comparison engine (processing-token verdict, variance top-up, `runComparison` orchestration) shared by model, prompt, and prompt-evolution benchmarking |\n`src/modelbench.ts` |\nModel-migration benchmarking: candidate model vs. agent default |\n`src/promptbench.ts` |\nPrompt A/B benchmarking: variant agent definition vs. shipped |\n`src/evolve.ts` |\nAutomated prompt evolution: propose a cheaper prompt (model call) → measure → recommend |\n\nData model (`~/.token-warden/warden.db`\n\n): `runs`\n\n(one row per session or golden run,\ntagged `real`\n\n/`active`\n\n/`candidate`\n\n/`audit`\n\n), `rules`\n\n(the ledger — candidates, active\nrules with measured deltas, and evicted rules kept as the negative dataset),\n`baselines`\n\n(frozen `run1_tokens`\n\n, ratcheting `best_tokens`\n\n), `ruleset_versions`\n\n, and\n`questions`\n\n(the inter-agent ledger). Every deviation from the original specification is\ndocumented in [ DECISIONS.md](/vukkt/token-warden/blob/main/DECISIONS.md).\n\n`frontend`\n\n, `backend`\n\n, `sql`\n\n, and `testing`\n\n(`agents/*.md`\n\n) are standard Claude Code\nsubagents with `memory: user`\n\nand domain-scoped prompts seeded with efficiency behaviors\n(Grep before Read, never re-read a file, one-line plan before editing). Use them like any\nsubagent — the optimizer extends each one's memory independently. Per-agent isolation is\ndeliberate: a rule that pays rent for the sql agent is never charged to the frontend\nagent's context.\n\nWith `CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1`\n\n, a `PreToolUse`\n\nhook intercepts every\n`SendMessage`\n\nbetween agents and escalates to you:\n\n```\n[frontend → backend] \"What does the orders service return on partial failure?\" — approve?\n```\n\nEvery question is logged to the `questions`\n\ntable — approved sends are confirmed by a\n`PostToolUse`\n\nhook; denied ones stay pending — and per-agent question volume surfaces in\n`/warden-status`\n\n. An agent that asks a lot is an agent whose memory is missing something.\nWithout the env flag the gate is structurally inert and everything else works untouched.\nThe gate fails open: an internal error defers to the normal permission flow rather than\nblocking team messaging.\n\n**Candidate rules are never injected until measured.** Unverified rules get no context space; candidates live only in SQLite.— compiled from the rule ledger, overwritten wholesale, never hand-edited or agent-appended.`MEMORY.md`\n\nis a build artifact**Fitness = tokens per completed task.** Incomplete runs are excluded from savings math.**Golden tasks run against the frozen fixture**, never a live codebase.** First-run baselines are frozen forever.**`run1_tokens`\n\nis the permanent denominator of every improvement claim.**The optimizer never re-does past work**— all learning is feed-forward.** Eviction is mandatory.**Rules must earn at least 2× their context rent, and active rules are re-audited round-robin.\n\nRecorded 2026-06-12; every number is from real headless runs.\n\n**A candidate is born.** Run #13, an `sql`\n\ngolden run, cost **61,003 tokens** — above the\nagent's rolling p75. The distiller proposed two candidates:\n\n| rule | body | rent |\n|---|---|---|\n| #3 | \"Consolidate file discovery into single queries instead of multiple find/ls operations across related paths.\" | 27 |\n| #4 | \"Parse task descriptions for technical direction; verify schema/dependencies only if code doesn't clarify them.\" | 28 |\n\n**The selector measures them** (24 headless runs: shared baseline, one configuration per\ncandidate, one re-audit). Mean completed tokens per task:\n\n| configuration | sql-01 | sql-02 | sql-03 | delta |\n|---|---|---|---|---|\n| baseline (active set) | 39,572 | 70,762 ⚠ | 50,304 | — |\n| + rule #3 | 39,541 | 67,114 | 52,116 | +622 saved/run |\n| + rule #4 | 39,664 | 54,244 | 49,538 | +5,731 saved/run |\n| − rule #1 (re-audit) | 39,671 | 49,006 | 44,315 ⚠ | rule #1 worth −9,215 |\n\n⚠ = the two same-configuration runs differed by >25%.\n\n**Verdicts** (threshold: savings ≥ 2× rent):\n\n- rule #3 →\n**ACTIVE**(622 ≥ 54) - rule #4 →\n**ACTIVE**(5,731 ≥ 56) - rule #1 (\"Use Grep to locate symbols before reading any file.\"), active since the\nprevious selector run at +3,673, was\n**EVICTED** on re-audit at −9,215: with the two new rules present, removing it made the suite cheaper. This is mandatory eviction working as designed — and an honest illustration that run-to-run variance dominates at small effect sizes. Evicted rules are retained as the negative dataset, and trigram dedupe prevents a falsified rule from being re-proposed.\n\n**The compiled memory** (`~/.claude/agent-memory/sql/MEMORY.md`\n\n, ruleset v2):\n\n``` php\n<!-- GENERATED BY token-warden — do not hand-edit -->\n# Efficiency rules\n\n- Parse task descriptions for technical direction; verify schema/dependencies only if code doesn't clarify them.\n- Consolidate file discovery into single queries instead of multiple find/ls operations across related paths.\nnpm run typecheck && npm run lint && npm run test\n```\n\nThe unit suite (count in the CI badge above — hard-coding it here rotted once already)\nspans every module. The transcript parser carries the densest coverage\n(usage dedup, completion heuristics, malformed-line tolerance, a 5 MB / 2 s performance\nbudget) against committed anonymized fixtures. The hook entrypoints (`collect.ts`\n\n,\n`gate.ts`\n\n) are tested as real child processes against temp databases, including\ncorrupt-input and fail-open paths. The selector core is tested with an injected fake\nsuite-runner, so verdict logic, regression eviction, re-audit, and memory compilation\nare verified without spending model tokens. Strict TypeScript (`noUncheckedIndexedAccess`\n\n),\nBiome for lint/format, vitest for tests.\n\nThe fixture has its own independent suite (`cd benchmarks/fixture && npm test`\n\n) and is\nexcluded from plugin CI — its deliberate flaws are benchmark material, not bugs.\n\n| Path | Contents |\n|---|---|\n`~/.token-warden/warden.db` |\nThe ledger (override with `TOKEN_WARDEN_DB` ) |\n`~/.token-warden/{collect,distill,gate}.log` |\nComponent logs — hooks never surface errors into sessions |\n`~/.claude/agent-memory/<agent>/MEMORY.md` |\nCompiled rules (generated; do not hand-edit) |\n`benchmarks/fixture/` |\nThe frozen benchmark codebase |\n\nThe ledger contains untrusted text: rule bodies and eviction reasons are model-generated, project paths and question senders come from the environment. Defenses, in order:\n\n- The distiller rejects rule bodies containing control characters or newlines at the source.\n`renderStatus`\n\nsanitizes every untrusted string it displays (ANSI/control characters stripped, newlines collapsed, length clamped), so collected data cannot forge report sections.- The\n`/warden-status`\n\ncommand instructs the relaying Claude to treat report contents as data, never as instructions.\n\nThe inter-agent gate is an observability and approval layer, not a security boundary — it fails open by design so a broken gate can never block team messaging.\n\nShipped since v0.1.0:\n\n- ✅\n**Subagent collection**—`Stop`\n\n*and*`SubagentStop`\n\nhooks, so the four domain agents' real work reaches the ledger (previously only the main session was collected and the learning loop could not engage on real work at all). - ✅\n**Variance-aware verdicts**— standard-error analysis of per-task savings with a bounded top-up pass when a verdict is within noise of the threshold (`--top-up`\n\n). - ✅\n**Selection nudge**— a`SessionStart`\n\nhook surfaces pending candidates;`/warden-select`\n\nruns the measurement on demand. - ✅\n**Question-driven distillation**— an agent's recent cross-agent questions are fed to the distiller as a memory-gap signal. - ✅\n**Per-project tracking**— real-work sessions record their project; status breaks down token volume per project. - ✅\n**Rule provenance**— active rules show the run they were distilled from. - ✅\n**Cross-project learning curves**—`/warden-status`\n\ncharts average completed real-work session cost per ruleset version, per agent and per project (domain agents only;`main`\n\nnever has compiled rules). This is the test of the system's core thesis: golden-suite gains must show up in real work. - ✅\n**Model-migration benchmarking**—`/warden-modelbench`\n\nruns an agent's golden suite under a candidate model vs. its current one (rules held constant) and reports which uses fewer tokens for that workload. The frozen suite*is*the fixed workload you need when a new model ships. The verdict uses processing tokens (input + output + cache_creation); cache-read is reported separately because it skews raw cross-model totals, and token counts are never converted to dollars (models are priced differently per token). - ✅\n**Prompt / agent-definition A/B testing**—`/warden-promptbench`\n\nruns an agent's golden suite under a variant agent definition vs. the shipped one (rules and model held constant), so a proposed prompt edit can be kept or rejected on measured token savings rather than vibes. The comparison engine (`compare.ts`\n\n) is shared with model benchmarking: the discipline generalizes from \"rule selection\" to \"any context change.\" - ✅\n**Automated prompt evolution**—`/warden-evolve`\n\nproposes a token-cheaper rewrite of an agent's prompt (a model call, like the rule distiller proposes rules), measures it through the shared engine, and recommends a winner to a proposals file. Deliberately*not*auto-applied:`agents/<name>.md`\n\nis committed source, and three golden tasks can't fully capture an agent's behavior, so a human reviews and applies. Protected frontmatter (name/tools/model/memory) is enforced unchanged before measurement.\n\nNear-term:\n\n**Golden suite growth**— heavy tasks (`testing-02`\n\n≈ 150k tokens/run) deserve splitting into new tasks (existing baselines stay frozen; replacing a task would invalidate its denominator, so growth means*adding*task files, never editing them).**Fully scheduled selection**— auto-running the selector on a cron/routine once variance handling has earned trust; today it deliberately stays a user decision.**Transcript provenance**— link a rule's`born-of`\n\nrun to its archived transcript digest for post-hoc review.\n\nBigger directions — the reusable asset here is the *frozen-benchmark + measured-verdict*\ndiscipline, which generalizes well beyond efficiency rules:\n\n**Model-migration benchmarking**— the frozen golden suite is exactly the fixed workload you need when a new model ships:`warden-bench`\n\ncould answer \"is the new model cheaper on*my*workload\" with the rigor rules already get.**Prompt / agent-definition A/B testing**— the benchmark measures any context change, not just rules; treat an agent's system prompt as a candidate and let the selector keep edits that earn their place.**Team-shared rule ledgers**— commit measured rules to a repo (via project-scoped memory) with token-warden as the CI gate, so a PR adding a rule must carry its measured delta. Memory review becomes code review.**Real-time cost anomaly alerting**— the p75 machinery already detects expensive sessions; a Stop-hook message could tell the session itself where its tokens went.\n\nMIT", "url": "https://wpnews.pro/news/token-warden-is-a-thrifty-office-manager-for-your-ai-assistants", "canonical_source": "https://github.com/vukkt/token-warden", "published_at": "2026-06-15 07:22:18+00:00", "updated_at": "2026-06-15 07:41:55.176883+00:00", "lang": "en", "topics": ["ai-agents", "developer-tools", "large-language-models"], "entities": ["Claude Code", "token-warden"], "alternates": {"html": "https://wpnews.pro/news/token-warden-is-a-thrifty-office-manager-for-your-ai-assistants", "markdown": "https://wpnews.pro/news/token-warden-is-a-thrifty-office-manager-for-your-ai-assistants.md", "text": "https://wpnews.pro/news/token-warden-is-a-thrifty-office-manager-for-your-ai-assistants.txt", "jsonld": "https://wpnews.pro/news/token-warden-is-a-thrifty-office-manager-for-your-ai-assistants.jsonld"}}