A Claude Code plugin that makes coding agents measurably cheaper over time.
Most "agent memory" accumulates advice nobody ever verifies. token-warden treats agent memory as an engineering problem: every rule that wants space in an agent's context must prove, on a fixed benchmark, that it saves more tokens than it costs β or it gets evicted. The result is a per-agent memory file containing only rules with measured, positive return.
Measured, not vibesβ every rule carries a token delta from real benchmark runs** Self-funding**β rules must save β₯ 2Γ their own context rent to stay** Self-auditing**β active rules are re-benchmarked round-robin and evicted when they stop earning** Zero session overhead**β collection runs in a Stop hook that never blocks or fails your work
How it worksGetting startedCommandsThe benchmark systemArchitectureThe agentsInter-agent approval gateDesign invariantsA recorded demonstrationTestingData layoutSecurity notesRoadmap
The optimizer is a four-stage, feed-forward loop. Lessons are extracted from finished sessions and applied to future ones β past work is never re-done.
agent session (any project, any repo)
β
β Stop hook Β· parses the transcript:
β tokens, tool calls, file re-reads, completion
βΌ
βββββββββββββββββββββββββββββββββββββββ
β 1 Β· COLLECT β
β one row per session in SQLite β
βββββββββββββββββββββββββββββββββββββββ
β
β fires only when a run exceeds the
β agent's rolling p75 token cost
βΌ
βββββββββββββββββββββββββββββββββββββββ
β 2 Β· DISTILL β
β one haiku call over the waste β
β stats β 0β2 candidate rules β
βββββββββββββββββββββββββββββββββββββββ
β
β candidates wait in SQLite β
β never injected until measured
βΌ
βββββββββββββββββββββββββββββββββββββββ
β 3 Β· BENCH β
β golden suite on a frozen fixture, β
β run with vs. without the candidate β
βββββββββββββββββββββββββββββββββββββββ
β
β measured delta vs. context rent
βΌ
βββββββββββββββββββββββββββββββββββββββ
β 4 Β· SELECT β
β keep if savings β₯ 2Γ rent, else β
β evict Β· re-audit the oldest rule β
βββββββββββββββββββββββββββββββββββββββ
β
βΌ
~/.claude/agent-memory/<agent>/MEMORY.md
compiled wholesale from surviving rules and injected
into the agent's system prompt next session
1 Β· Collect. Stop
and SubagentStop
hooks fire after every turn (main session and subagent work respectively) and parse the session transcript into one ledger row: input/output/cache tokens (deduplicated by API message id β the transcript repeats usage per streamed block), tool-call count, files read more than once, and whether the session completed. The hook is hard-capped under the 2-second budget, wraps every failure, and exits 0 regardless β it can never break your session.
2 Β· Distill. Collection is cheap, analysis is not, so analysis is rationed: only runs above the agent's rolling 75th-percentile cost (minimum 5 prior runs) are distilled. A single detached haiku-tier call receives the waste statistics plus an 8 KB action trace and must return strict JSON: at most two one-sentence, generalizable rules. Invalid output is dropped, never retried. Near-duplicates of any existing rule β including evicted ones β are rejected by trigram similarity, so a falsified rule cannot be re-proposed.
3 Β· Bench. Candidates are measured on a golden task suite against a frozen fixture repository (see The benchmark system). Each configuration runs the suite headlessly in a throwaway copy with the candidate compiled into a temporary, fully isolated agent memory.
4 Β· Select. A rule's verdict is the spec inequality: with delta
= mean tokens saved
per completed golden run and rent
= the rule's own size in tokens, the rule goes active
iff delta Γ sessions/week β₯ 2 Γ rent Γ sessions/week
. Failing a previously-passing task
is instant eviction regardless of tokens. Every selector run also re-benchmarks the
least-recently-audited active rule β memory must keep earning its place. Survivors are
compiled into MEMORY.md
, which Claude Code injects into the agent's system prompt.
- Node.js 22+
- Claude Code v2.1+ (
claude --version
) - macOS or Linux (Windows via WSL β benchmarks need a POSIX shell)
git clone https://github.com/vukkt/token-warden.git
cd token-warden
npm install # the hooks run via the plugin's own tsx + better-sqlite3
For the current session:
claude --plugin-dir /path/to/token-warden
Or install persistently β this repository is also its own marketplace:
/plugin marketplace add vukkt/token-warden
/plugin install token-warden@vukkt-plugins
Marketplace installs are copied to
~/.claude/plugins/cache
withoutnode_modules
. The Stop hook bootstraps its own dependencies on first run (one-timenpm install
, silent); collection begins from the second session at the latest.
Work normally for a turn or two, then:
/warden-status
You should see a runs
count for main
. Every session in every project is now being
measured into ~/.token-warden/warden.db
.
npm run bench -- --agent all # or one agent at a time
This runs each agent's three golden tasks twice and freezes run1_tokens
β the permanent denominator of every future improvement claim. Do this once, before any rules exist.
Use the four subagents (frontend
, backend
, sql
, testing
) for real work.
Expensive sessions distill into candidates automatically. When /warden-status
shows candidates pending, measure them:
npx tsx src/select.ts --agent sql
Active rules land in the agent's memory; the next session starts cheaper.
| Command | What it does |
|---|---|
/warden-status |
|
| Read-only report: per-agent run/rule counts, suite total vs. frozen baseline (absolute + %), learning curve over time, active rules with measured deltas and provenance, recent evictions with reasons, real-work tokens by project, cross-agent question volume | |
| `/warden-bench <agent | all> [--runs N] [--task id]` |
Runs the golden suite, compares against run1 and best , and reports benchmarking meta-cost (warns above 10% of the week's real-work tokens) |
|
/warden-select <agent> [--runs N] [--top-up N] |
|
| Measures pending candidates, evicts or activates them, re-audits the oldest active rule, and recompiles the agent's memory | |
/warden-modelbench <agent> --model <id> [--baseline <id>] [--runs N] |
|
| Runs the agent's golden suite under two models (candidate vs. the agent's current model, rules held constant) and reports which uses fewer tokens for that workload | |
/warden-promptbench <agent> --variant <file.md> [--runs N] |
|
| Runs the agent's golden suite under two prompts (a variant agent definition vs. the shipped one, rules and model held constant) and reports which uses fewer tokens | |
/warden-evolve <agent> [--runs N] |
|
| Proposes a token-cheaper rewrite of the agent's prompt (model call), benchmarks it, and recommends it only if it provably wins β never auto-applied |
When candidate rules are waiting, a lightweight SessionStart
hook injects a one-line nudge into new sessions β selection itself always stays a user decision, because it spends real benchmark tokens.
Headless or when names collide, use the namespaced forms
(/token-warden:warden-status
). CLI equivalents:
npx tsx src/status.ts # status report
npm run bench -- --agent sql [--rule N] # benchmark runner
npx tsx src/select.ts --agent sql # selector (measure + evict + compile)
npx tsx src/modelbench.ts --agent sql --model haiku # compare a model against the agent's default
npx tsx src/promptbench.ts --agent sql --variant v.md # compare a prompt variant against the shipped one
npx tsx src/evolve.ts --agent sql # propose + measure a cheaper prompt variant
Measurement is only as good as its control variables. token-warden controls them aggressively:
The fixture (benchmarks/fixture/
) is a small but realistic full-stack TypeScript
project β Express routes β services β repositories over SQLite, a React admin UI, a
partial vitest suite β frozen at Phase 2 and never modified, so baselines stay
comparable across months. It ships with documented, deliberate flaws (BUGS.md
, which agents never see: the benchmark runner excludes it from every copy) that the golden tasks target.
Golden tasks (benchmarks/<agent>/golden-NN.md
) β three per agent, each a frontmatter
file with a one-sentence prompt
and a shell success_check
(greps and/or a full
vitest run
). A run only counts as completed if its check passes: a cheap failed run is worse than an expensive successful one, and incomplete runs are excluded from all savings math.
A benchmark run, end to end:
- Copy the fixture to a temp dir (
node_modules
symlinked;BUGS.md
excluded). - Install the agent definition into the copy with its memory scope rewritten to
project
, so the compiledMEMORY.md
under test resolvesinside the temp dirβ real agent memory is never read or written by benchmarks. - Compile the rule set under test (active rules Β± one candidate) into that memory.
- Run
claude -p --agent <name>
headlessly withscoped permissions:acceptEdits
plus a Bash allowlist of test commands only β neverbypassPermissions
. - Run the
success_check
; parse the transcript; record oneruns
row. - First-ever completed run per (agent, task) freezes
baselines.run1_tokens
forever; later completed runs only ratchetbest_tokens
downward.
Variance and honesty. Each configuration runs twice and pairs of runs differing by
more than 25% are flagged in the output. LLM variance is the dominant error source at
small effect sizes β the recorded demonstration below shows it evicting a rule. The
selector is variance-aware: it computes the standard error of the per-task savings, and
when a verdict sits within one standard error of the keep/evict threshold it spends one
bounded top-up pass (extra suite runs of the measured configuration, budget
configurable via --top-up
, default 1) before deciding; verdicts that remain within noise are recorded with an explicit low-confidence annotation. The benchmark also reports its own meta-cost after every invocation: when benchmarking exceeds 10% of the week's collected real-work tokens, it tells you to bench less.
| Module | Responsibility |
|---|---|
src/db.ts |
|
SQLite schema, versioned migrations (PRAGMA user_version ), typed query helpers |
|
src/transcript.ts |
|
| Pure transcript JSONL parser β usage dedup, tool calls, re-reads, completion heuristic, distiller digest | |
src/collect.ts |
|
| Stop-hook entrypoint; p75 trigger; spawns the distiller detached | |
src/distill.ts |
|
| Waste analysis β 0β2 strict-JSON candidate rules; trigram dedupe | |
src/bench.ts |
|
| Golden-suite runner; baseline freezing; meta-cost accounting | |
src/select.ts |
|
Keep/evict verdicts; round-robin re-audit; MEMORY.md compiler |
|
src/status.ts |
|
Read-only reporting behind /warden-status |
|
src/gate.ts |
|
Inter-agent SendMessage approval gate (Agent Teams) |
|
src/notify.ts |
|
| SessionStart nudge when candidates await measurement | |
src/compare.ts |
|
Generic A/B comparison engine (processing-token verdict, variance top-up, runComparison orchestration) shared by model, prompt, and prompt-evolution benchmarking |
|
src/modelbench.ts |
|
| Model-migration benchmarking: candidate model vs. agent default | |
src/promptbench.ts |
|
| Prompt A/B benchmarking: variant agent definition vs. shipped | |
src/evolve.ts |
|
| Automated prompt evolution: propose a cheaper prompt (model call) β measure β recommend |
Data model (~/.token-warden/warden.db
): runs
(one row per session or golden run,
tagged real
/active
/candidate
/audit
), rules
(the ledger β candidates, active
rules with measured deltas, and evicted rules kept as the negative dataset),
baselines
(frozen run1_tokens
, ratcheting best_tokens
), ruleset_versions
, and
questions
(the inter-agent ledger). Every deviation from the original specification is documented in DECISIONS.md.
frontend
, backend
, sql
, and testing
(agents/*.md
) are standard Claude Code
subagents with memory: user
and domain-scoped prompts seeded with efficiency behaviors (Grep before Read, never re-read a file, one-line plan before editing). Use them like any subagent β the optimizer extends each one's memory independently. Per-agent isolation is deliberate: a rule that pays rent for the sql agent is never charged to the frontend agent's context.
With CLAUDE_CODE_EXPERIMENTAL_AGENT_TEAMS=1
, a PreToolUse
hook intercepts every
SendMessage
between agents and escalates to you:
[frontend β backend] "What does the orders service return on partial failure?" β approve?
Every question is logged to the questions
table β approved sends are confirmed by a
PostToolUse
hook; denied ones stay pending β and per-agent question volume surfaces in
/warden-status
. An agent that asks a lot is an agent whose memory is missing something. Without the env flag the gate is structurally inert and everything else works untouched. The gate fails open: an internal error defers to the normal permission flow rather than blocking team messaging.
Candidate rules are never injected until measured. Unverified rules get no context space; candidates live only in SQLite.β compiled from the rule ledger, overwritten wholesale, never hand-edited or agent-appended.MEMORY.md
is a build artifactFitness = tokens per completed task. Incomplete runs are excluded from savings math.Golden tasks run against the frozen fixture, never a live codebase.** First-run baselines are frozen forever.**run1_tokens
is the permanent denominator of every improvement claim.The optimizer never re-does past workβ all learning is feed-forward.** Eviction is mandatory.**Rules must earn at least 2Γ their context rent, and active rules are re-audited round-robin.
Recorded 2026-06-12; every number is from real headless runs.
A candidate is born. Run #13, an sql
golden run, cost 61,003 tokens β above the agent's rolling p75. The distiller proposed two candidates:
| rule | body | rent |
|---|---|---|
| #3 | "Consolidate file discovery into single queries instead of multiple find/ls operations across related paths." | 27 |
| #4 | "Parse task descriptions for technical direction; verify schema/dependencies only if code doesn't clarify them." | 28 |
The selector measures them (24 headless runs: shared baseline, one configuration per candidate, one re-audit). Mean completed tokens per task:
| configuration | sql-01 | sql-02 | sql-03 | delta |
|---|---|---|---|---|
| baseline (active set) | 39,572 | 70,762 β | 50,304 | β |
| + rule #3 | 39,541 | 67,114 | 52,116 | +622 saved/run |
| + rule #4 | 39,664 | 54,244 | 49,538 | +5,731 saved/run |
| β rule #1 (re-audit) | 39,671 | 49,006 | 44,315 β | rule #1 worth β9,215 |
β = the two same-configuration runs differed by >25%.
Verdicts (threshold: savings β₯ 2Γ rent):
- rule #3 β ACTIVE(622 β₯ 54) - rule #4 β ACTIVE(5,731 β₯ 56) - rule #1 ("Use Grep to locate symbols before reading any file."), active since the previous selector run at +3,673, was EVICTED on re-audit at β9,215: with the two new rules present, removing it made the suite cheaper. This is mandatory eviction working as designed β and an honest illustration that run-to-run variance dominates at small effect sizes. Evicted rules are retained as the negative dataset, and trigram dedupe prevents a falsified rule from being re-proposed.
The compiled memory (~/.claude/agent-memory/sql/MEMORY.md
, ruleset v2):
<!-- GENERATED BY token-warden β do not hand-edit -->
- Parse task descriptions for technical direction; verify schema/dependencies only if code doesn't clarify them.
- Consolidate file discovery into single queries instead of multiple find/ls operations across related paths.
npm run typecheck && npm run lint && npm run test
The unit suite (count in the CI badge above β hard-coding it here rotted once already)
spans every module. The transcript parser carries the densest coverage
(usage dedup, completion heuristics, malformed-line tolerance, a 5 MB / 2 s performance
budget) against committed anonymized fixtures. The hook entrypoints (collect.ts
,
gate.ts
) are tested as real child processes against temp databases, including
corrupt-input and fail-open paths. The selector core is tested with an injected fake
suite-runner, so verdict logic, regression eviction, re-audit, and memory compilation
are verified without spending model tokens. Strict TypeScript (noUncheckedIndexedAccess
), Biome for lint/format, vitest for tests.
The fixture has its own independent suite (cd benchmarks/fixture && npm test
) and is excluded from plugin CI β its deliberate flaws are benchmark material, not bugs.
| Path | Contents |
|---|---|
~/.token-warden/warden.db |
|
The ledger (override with TOKEN_WARDEN_DB ) |
|
~/.token-warden/{collect,distill,gate}.log |
|
| Component logs β hooks never surface errors into sessions | |
~/.claude/agent-memory/<agent>/MEMORY.md |
|
| Compiled rules (generated; do not hand-edit) | |
benchmarks/fixture/ |
|
| The frozen benchmark codebase |
The ledger contains untrusted text: rule bodies and eviction reasons are model-generated, project paths and question senders come from the environment. Defenses, in order:
- The distiller rejects rule bodies containing control characters or newlines at the source.
renderStatus
sanitizes every untrusted string it displays (ANSI/control characters stripped, newlines collapsed, length clamped), so collected data cannot forge report sections.- The
/warden-status
command instructs the relaying Claude to treat report contents as data, never as instructions.
The inter-agent gate is an observability and approval layer, not a security boundary β it fails open by design so a broken gate can never block team messaging.
Shipped since v0.1.0:
- β
Subagent collectionβ
Stop
andSubagentStop
hooks, so the four domain agents' real work reaches the ledger (previously only the main session was collected and the learning loop could not engage on real work at all). - β
Variance-aware verdictsβ standard-error analysis of per-task savings with a bounded top-up pass when a verdict is within noise of the threshold (--top-up
). - β
Selection nudgeβ aSessionStart
hook surfaces pending candidates;/warden-select
runs the measurement on demand. - β
Question-driven distillationβ an agent's recent cross-agent questions are fed to the distiller as a memory-gap signal. - β
Per-project trackingβ real-work sessions record their project; status breaks down token volume per project. - β
Rule provenanceβ active rules show the run they were distilled from. - β
Cross-project learning curvesβ/warden-status
charts average completed real-work session cost per ruleset version, per agent and per project (domain agents only;main
never has compiled rules). This is the test of the system's core thesis: golden-suite gains must show up in real work. - β
Model-migration benchmarkingβ/warden-modelbench
runs an agent's golden suite under a candidate model vs. its current one (rules held constant) and reports which uses fewer tokens for that workload. The frozen suiteisthe fixed workload you need when a new model ships. The verdict uses processing tokens (input + output + cache_creation); cache-read is reported separately because it skews raw cross-model totals, and token counts are never converted to dollars (models are priced differently per token). - β
Prompt / agent-definition A/B testingβ/warden-promptbench
runs an agent's golden suite under a variant agent definition vs. the shipped one (rules and model held constant), so a proposed prompt edit can be kept or rejected on measured token savings rather than vibes. The comparison engine (compare.ts
) is shared with model benchmarking: the discipline generalizes from "rule selection" to "any context change." - β
Automated prompt evolutionβ/warden-evolve
proposes a token-cheaper rewrite of an agent's prompt (a model call, like the rule distiller proposes rules), measures it through the shared engine, and recommends a winner to a proposals file. Deliberatelynotauto-applied:agents/<name>.md
is committed source, and three golden tasks can't fully capture an agent's behavior, so a human reviews and applies. Protected frontmatter (name/tools/model/memory) is enforced unchanged before measurement.
Near-term:
Golden suite growthβ heavy tasks (testing-02
β 150k tokens/run) deserve splitting into new tasks (existing baselines stay frozen; replacing a task would invalidate its denominator, so growth meansaddingtask files, never editing them).Fully scheduled selectionβ auto-running the selector on a cron/routine once variance handling has earned trust; today it deliberately stays a user decision.Transcript provenanceβ link a rule'sborn-of
run to its archived transcript digest for post-hoc review.
Bigger directions β the reusable asset here is the frozen-benchmark + measured-verdict discipline, which generalizes well beyond efficiency rules:
Model-migration benchmarkingβ the frozen golden suite is exactly the fixed workload you need when a new model ships:warden-bench
could answer "is the new model cheaper onmyworkload" with the rigor rules already get.Prompt / agent-definition A/B testingβ the benchmark measures any context change, not just rules; treat an agent's system prompt as a candidate and let the selector keep edits that earn their place.Team-shared rule ledgersβ commit measured rules to a repo (via project-scoped memory) with token-warden as the CI gate, so a PR adding a rule must carry its measured delta. Memory review becomes code review.Real-time cost anomaly alertingβ the p75 machinery already detects expensive sessions; a Stop-hook message could tell the session itself where its tokens went.
MIT