Token-warden is a thrifty office manager for your AI assistants

Token-warden, a Claude Code plugin, treats AI agent memory as an engineering problem by requiring every rule to prove it saves more tokens than it costs on a fixed benchmark or face eviction. The plugin uses a four-stage feed-forward loop to collect session data, distill candidate rules, benchmark them, and select only those with a measured positive return, making coding agents measurably cheaper over time.

A Claude Code plugin that makes coding agents measurably cheaper over time. Most "agent memory" accumulates advice nobody ever verifies. token-warden treats agent memory as an engineering problem: every rule that wants space in an agent's context must prove, on a fixed benchmark, that it saves more tokens than it costs — or it gets evicted. The result is a per-agent memory file containing only rules with measured, positive return. Measured, not vibes — every rule carries a token delta from real benchmark runs Self-funding — rules must save ≥ 2× their own context rent to stay Self-auditing — active rules are re-benchmarked round-robin and evicted when they stop earning Zero session overhead — collection runs in a Stop hook that never blocks or fails your work How it works how-it-works Getting started getting-started Commands commands The benchmark system the-benchmark-system Architecture architecture The agents the-agents Inter-agent approval gate inter-agent-approval-gate-experimental Design invariants design-invariants A recorded demonstration a-recorded-demonstration Testing testing Data layout data-layout Security notes security-notes Roadmap roadmap The optimizer is a four-stage, feed-forward loop. Lessons are extracted from finished sessions and applied to future ones — past work is never re-done. agent session any project, any repo │ │ Stop hook · parses the transcript: │ tokens, tool calls, file re-reads, completion ▼ ┌─────────────────────────────────────┐ │ 1 · COLLECT │ │ one row per session in SQLite │ └─────────────────────────────────────┘ │ │ fires only when a run exceeds the │ agent's rolling p75 token cost ▼ ┌─────────────────────────────────────┐ │ 2 · DISTILL │ │ one haiku call over the waste │ │ stats → 0–2 candidate rules │ └─────────────────────────────────────┘ │ │ candidates wait in SQLite — │ never injected until measured ▼ ┌─────────────────────────────────────┐ │ 3 · BENCH │ │ golden suite on a frozen fixture, │ │ run with vs. without the candidate │ └─────────────────────────────────────┘ │ │ measured delta vs. context rent ▼ ┌─────────────────────────────────────┐ │ 4 · SELECT │ │ keep if savings ≥ 2× rent, else │ │ evict · re-audit the oldest rule │ └─────────────────────────────────────┘ │ ▼ ~/.claude/agent-memory/<agent /MEMORY.md compiled wholesale from surviving rules and injected into the agent's system prompt next session 1 · Collect. Stop and SubagentStop hooks fire after every turn main session and subagent work respectively and parse the session transcript into one ledger row: input/output/cache tokens deduplicated by API message id — the transcript repeats usage per streamed block , tool-call count, files read more than once, and whether the session completed. The hook is hard-capped under the 2-second budget, wraps every failure, and exits 0 regardless — it can never break your session. 2 · Distill. Collection is cheap, analysis is not, so analysis is rationed: only runs above the agent's rolling 75th-percentile cost minimum 5 prior runs are distilled. A single detached haiku-tier call receives the waste statistics plus an 8 KB action trace and must return strict JSON: at most two one-sentence, generalizable rules. Invalid output is dropped, never retried. Near-duplicates of any existing rule — including evicted ones — are rejected by trigram similarity, so a falsified rule cannot be re-proposed. 3 · Bench. Candidates are measured on a golden task suite against a frozen fixture repository see The benchmark system the-benchmark-system . Each configuration runs the suite headlessly in a throwaway copy with the candidate compiled into a temporary, fully isolated agent memory. 4 · Select. A rule's verdict is the spec inequality: with delta = mean tokens saved per completed golden run and rent = the rule's own size in tokens, the rule goes active iff delta × sessions/week ≥ 2 × rent × sessions/week . Failing a previously-passing task is instant eviction regardless of tokens. Every selector run also re-benchmarks the least-recently-audited active rule — memory must keep earning its place. Survivors are compiled into MEMORY.md , which Claude Code injects into the agent's system prompt. - Node.js 22+ - Claude Code v2.1+ claude --version - macOS or Linux Windows via WSL — benchmarks need a POSIX shell git clone https://github.com/vukkt/token-warden.git cd token-warden npm install the hooks run via the plugin's own tsx + better-sqlite3 For the current session: claude --plugin-dir /path/to/token-warden Or install persistently — this repository is also its own marketplace: /plugin marketplace add vukkt/token-warden /plugin install token-warden@vukkt-plugins Marketplace installs are copied to ~/.claude/plugins/cache without node modules . The Stop hook bootstraps its own dependencies on first run one-time npm install , silent ; collection begins from the second session at the latest. Work normally for a turn or two, then: /warden-status You should see a runs count for main . Every session in every project is now being measured into ~/.token-warden/warden.db . npm run bench -- --agent all or one agent at a time This runs each agent's three golden tasks twice and freezes run1 tokens — the permanent denominator of every future improvement claim. Do this once, before any rules exist. Use the four subagents frontend , backend , sql , testing for real work. Expensive sessions distill into candidates automatically. When /warden-status shows candidates pending, measure them: npx tsx src/select.ts --agent sql Active rules land in the agent's memory; the next session starts cheaper. | Command | What it does | |---|---| /warden-status | Read-only report: per-agent run/rule counts, suite total vs. frozen baseline absolute + % , learning curve over time, active rules with measured deltas and provenance, recent evictions with reasons, real-work tokens by project, cross-agent question volume | /warden-bench <agent|all --runs N --task id | Runs the golden suite, compares against run1 and best , and reports benchmarking meta-cost warns above 10% of the week's real-work tokens | /warden-select <agent --runs N --top-up N | Measures pending candidates, evicts or activates them, re-audits the oldest active rule, and recompiles the agent's memory | /warden-modelbench <agent --model <id --baseline <id --runs N | Runs the agent's golden suite under two models candidate vs. the agent's current model, rules held constant and reports which uses fewer tokens for that workload | /warden-promptbench <agent --variant <file.md --runs N | Runs the agent's golden suite under two prompts a variant agent definition vs. the shipped one, rules and model held constant and reports which uses fewer tokens | /warden-evolve <agent --runs N | Proposes a token-cheaper rewrite of the agent's prompt model call , benchmarks it, and recommends it only if it provably wins — never auto-applied | When candidate rules are waiting, a lightweight SessionStart hook injects a one-line nudge into new sessions — selection itself always stays a user decision, because it spends real benchmark tokens. Headless or when names collide, use the namespaced forms /token-warden:warden-status . CLI equivalents: npx tsx src/status.ts status report npm run bench -- --agent sql --rule N benchmark runner npx tsx src/select.ts --agent sql selector measure + evict + compile npx tsx src/modelbench.ts --agent sql --model haiku compare a model against the agent's default npx tsx src/promptbench.ts --agent sql --variant v.md compare a prompt variant against the shipped one npx tsx src/evolve.ts --agent sql propose + measure a cheaper prompt variant Measurement is only as good as its control variables. token-warden controls them aggressively: The fixture benchmarks/fixture/ is a small but realistic full-stack TypeScript project — Express routes → services → repositories over SQLite, a React admin UI, a partial vitest suite — frozen at Phase 2 and never modified , so baselines stay comparable across months. It ships with documented, deliberate flaws BUGS.md , which agents never see: the benchmark runner excludes it from every copy that the golden tasks target. Golden tasks benchmarks/<agent /golden-NN.md — three per agent, each a frontmatter file with a one-sentence prompt and a shell success check greps and/or a full vitest run . A run only counts as completed if its check passes: a cheap failed run is worse than an expensive successful one, and incomplete runs are excluded from all savings math. A benchmark run , end to end: - Copy the fixture to a temp dir node modules symlinked; BUGS.md excluded . - Install the agent definition into the copy with its memory scope rewritten to project , so the compiled MEMORY.md under test resolves inside the temp dir — real agent memory is never read or written by benchmarks. - Compile the rule set under test active rules ± one candidate into that memory. - Run claude -p --agent <name headlessly with scoped permissions : acceptEdits plus a Bash allowlist of test commands only — never bypassPermissions . - Run the success check ; parse the transcript; record one runs row. - First-ever completed run per agent, task freezes baselines.run1 tokens forever; later completed runs only ratchet best tokens downward. Variance and honesty. Each configuration runs twice and pairs of runs differing by more than 25% are flagged in the output. LLM variance is the dominant error source at small effect sizes — the recorded demonstration below shows it evicting a rule. The selector is variance-aware: it computes the standard error of the per-task savings, and when a verdict sits within one standard error of the keep/evict threshold it spends one bounded top-up pass extra suite runs of the measured configuration, budget configurable via --top-up , default 1 before deciding; verdicts that remain within noise are recorded with an explicit low-confidence annotation. The benchmark also reports its own meta-cost after every invocation: when benchmarking exceeds 10% of the week's collected real-work tokens, it tells you to bench less. | Module | Responsibility | |---|---| src/db.ts | SQLite schema, versioned migrations PRAGMA user version , typed query helpers | src/transcript.ts | Pure transcript JSONL parser — usage dedup, tool calls, re-reads, completion heuristic, distiller digest | src/collect.ts | Stop-hook entrypoint; p75 trigger; spawns the distiller detached | src/distill.ts | Waste analysis → 0–2 strict-JSON candidate rules; trigram dedupe | src/bench.ts | Golden-suite runner; baseline freezing; meta-cost accounting | src/select.ts | Keep/evict verdicts; round-robin re-audit; MEMORY.md compiler | src/status.ts | Read-only reporting behind /warden-status | src/gate.ts | Inter-agent SendMessage approval gate Agent Teams | src/notify.ts | SessionStart nudge when candidates await measurement | src/compare.ts | Generic A/B comparison engine processing-token verdict, variance top-up, runComparison orchestration shared by model, prompt, and prompt-evolution benchmarking | src/modelbench.ts | Model-migration benchmarking: candidate model vs. agent default | src/promptbench.ts | Prompt A/B benchmarking: variant agent definition vs. shipped | src/evolve.ts | Automated prompt evolution: propose a cheaper prompt model call → measure → recommend | Data model ~/.token-warden/warden.db : runs one row per session or golden run, tagged real / active / candidate / audit , rules the ledger — candidates, active rules with measured deltas, and evicted rules kept as the negative dataset , baselines frozen run1 tokens , ratcheting best tokens , ruleset versions , and questions the inter-agent ledger . Every deviation from the original specification is documented in DECISIONS.md /vukkt/token-warden/blob/main/DECISIONS.md . frontend , backend , sql , and testing agents/ .md are standard Claude Code subagents with memory: user and domain-scoped prompts seeded with efficiency behaviors Grep before Read, never re-read a file, one-line plan before editing . Use them like any subagent — the optimizer extends each one's memory independently. Per-agent isolation is deliberate: a rule that pays rent for the sql agent is never charged to the frontend agent's context. With CLAUDE CODE EXPERIMENTAL AGENT TEAMS=1 , a PreToolUse hook intercepts every SendMessage between agents and escalates to you: frontend → backend "What does the orders service return on partial failure?" — approve? Every question is logged to the questions table — approved sends are confirmed by a PostToolUse hook; denied ones stay pending — and per-agent question volume surfaces in /warden-status . An agent that asks a lot is an agent whose memory is missing something. Without the env flag the gate is structurally inert and everything else works untouched. The gate fails open: an internal error defers to the normal permission flow rather than blocking team messaging. Candidate rules are never injected until measured. Unverified rules get no context space; candidates live only in SQLite.— compiled from the rule ledger, overwritten wholesale, never hand-edited or agent-appended. MEMORY.md is a build artifact Fitness = tokens per completed task. Incomplete runs are excluded from savings math. Golden tasks run against the frozen fixture , never a live codebase. First-run baselines are frozen forever. run1 tokens is the permanent denominator of every improvement claim. The optimizer never re-does past work — all learning is feed-forward. Eviction is mandatory. Rules must earn at least 2× their context rent, and active rules are re-audited round-robin. Recorded 2026-06-12; every number is from real headless runs. A candidate is born. Run 13, an sql golden run, cost 61,003 tokens — above the agent's rolling p75. The distiller proposed two candidates: | rule | body | rent | |---|---|---| | 3 | "Consolidate file discovery into single queries instead of multiple find/ls operations across related paths." | 27 | | 4 | "Parse task descriptions for technical direction; verify schema/dependencies only if code doesn't clarify them." | 28 | The selector measures them 24 headless runs: shared baseline, one configuration per candidate, one re-audit . Mean completed tokens per task: | configuration | sql-01 | sql-02 | sql-03 | delta | |---|---|---|---|---| | baseline active set | 39,572 | 70,762 ⚠ | 50,304 | — | | + rule 3 | 39,541 | 67,114 | 52,116 | +622 saved/run | | + rule 4 | 39,664 | 54,244 | 49,538 | +5,731 saved/run | | − rule 1 re-audit | 39,671 | 49,006 | 44,315 ⚠ | rule 1 worth −9,215 | ⚠ = the two same-configuration runs differed by 25%. Verdicts threshold: savings ≥ 2× rent : - rule 3 → ACTIVE 622 ≥ 54 - rule 4 → ACTIVE 5,731 ≥ 56 - rule 1 "Use Grep to locate symbols before reading any file." , active since the previous selector run at +3,673, was EVICTED on re-audit at −9,215: with the two new rules present, removing it made the suite cheaper. This is mandatory eviction working as designed — and an honest illustration that run-to-run variance dominates at small effect sizes. Evicted rules are retained as the negative dataset, and trigram dedupe prevents a falsified rule from being re-proposed. The compiled memory ~/.claude/agent-memory/sql/MEMORY.md , ruleset v2 : php < -- GENERATED BY token-warden — do not hand-edit -- Efficiency rules - Parse task descriptions for technical direction; verify schema/dependencies only if code doesn't clarify them. - Consolidate file discovery into single queries instead of multiple find/ls operations across related paths. npm run typecheck && npm run lint && npm run test The unit suite count in the CI badge above — hard-coding it here rotted once already spans every module. The transcript parser carries the densest coverage usage dedup, completion heuristics, malformed-line tolerance, a 5 MB / 2 s performance budget against committed anonymized fixtures. The hook entrypoints collect.ts , gate.ts are tested as real child processes against temp databases, including corrupt-input and fail-open paths. The selector core is tested with an injected fake suite-runner, so verdict logic, regression eviction, re-audit, and memory compilation are verified without spending model tokens. Strict TypeScript noUncheckedIndexedAccess , Biome for lint/format, vitest for tests. The fixture has its own independent suite cd benchmarks/fixture && npm test and is excluded from plugin CI — its deliberate flaws are benchmark material, not bugs. | Path | Contents | |---|---| ~/.token-warden/warden.db | The ledger override with TOKEN WARDEN DB | ~/.token-warden/{collect,distill,gate}.log | Component logs — hooks never surface errors into sessions | ~/.claude/agent-memory/<agent /MEMORY.md | Compiled rules generated; do not hand-edit | benchmarks/fixture/ | The frozen benchmark codebase | The ledger contains untrusted text: rule bodies and eviction reasons are model-generated, project paths and question senders come from the environment. Defenses, in order: - The distiller rejects rule bodies containing control characters or newlines at the source. renderStatus sanitizes every untrusted string it displays ANSI/control characters stripped, newlines collapsed, length clamped , so collected data cannot forge report sections.- The /warden-status command instructs the relaying Claude to treat report contents as data, never as instructions. The inter-agent gate is an observability and approval layer, not a security boundary — it fails open by design so a broken gate can never block team messaging. Shipped since v0.1.0: - ✅ Subagent collection — Stop and SubagentStop hooks, so the four domain agents' real work reaches the ledger previously only the main session was collected and the learning loop could not engage on real work at all . - ✅ Variance-aware verdicts — standard-error analysis of per-task savings with a bounded top-up pass when a verdict is within noise of the threshold --top-up . - ✅ Selection nudge — a SessionStart hook surfaces pending candidates; /warden-select runs the measurement on demand. - ✅ Question-driven distillation — an agent's recent cross-agent questions are fed to the distiller as a memory-gap signal. - ✅ Per-project tracking — real-work sessions record their project; status breaks down token volume per project. - ✅ Rule provenance — active rules show the run they were distilled from. - ✅ Cross-project learning curves — /warden-status charts average completed real-work session cost per ruleset version, per agent and per project domain agents only; main never has compiled rules . This is the test of the system's core thesis: golden-suite gains must show up in real work. - ✅ Model-migration benchmarking — /warden-modelbench runs an agent's golden suite under a candidate model vs. its current one rules held constant and reports which uses fewer tokens for that workload. The frozen suite is the fixed workload you need when a new model ships. The verdict uses processing tokens input + output + cache creation ; cache-read is reported separately because it skews raw cross-model totals, and token counts are never converted to dollars models are priced differently per token . - ✅ Prompt / agent-definition A/B testing — /warden-promptbench runs an agent's golden suite under a variant agent definition vs. the shipped one rules and model held constant , so a proposed prompt edit can be kept or rejected on measured token savings rather than vibes. The comparison engine compare.ts is shared with model benchmarking: the discipline generalizes from "rule selection" to "any context change." - ✅ Automated prompt evolution — /warden-evolve proposes a token-cheaper rewrite of an agent's prompt a model call, like the rule distiller proposes rules , measures it through the shared engine, and recommends a winner to a proposals file. Deliberately not auto-applied: agents/<name .md is committed source, and three golden tasks can't fully capture an agent's behavior, so a human reviews and applies. Protected frontmatter name/tools/model/memory is enforced unchanged before measurement. Near-term: Golden suite growth — heavy tasks testing-02 ≈ 150k tokens/run deserve splitting into new tasks existing baselines stay frozen; replacing a task would invalidate its denominator, so growth means adding task files, never editing them . Fully scheduled selection — auto-running the selector on a cron/routine once variance handling has earned trust; today it deliberately stays a user decision. Transcript provenance — link a rule's born-of run to its archived transcript digest for post-hoc review. Bigger directions — the reusable asset here is the frozen-benchmark + measured-verdict discipline, which generalizes well beyond efficiency rules: Model-migration benchmarking — the frozen golden suite is exactly the fixed workload you need when a new model ships: warden-bench could answer "is the new model cheaper on my workload" with the rigor rules already get. Prompt / agent-definition A/B testing — the benchmark measures any context change, not just rules; treat an agent's system prompt as a candidate and let the selector keep edits that earn their place. Team-shared rule ledgers — commit measured rules to a repo via project-scoped memory with token-warden as the CI gate, so a PR adding a rule must carry its measured delta. Memory review becomes code review. Real-time cost anomaly alerting — the p75 machinery already detects expensive sessions; a Stop-hook message could tell the session itself where its tokens went. MIT