Every AI coding agent has the same hidden tax: the context window. Anthropic’s guide to effective context engineering calls context “a critical yet limited resource,” and the research behind it is blunt: as the window fills, model accuracy degrades. The team calls this context rot. Token optimization isn’t only about cost. Every token your tooling injects is a token the model can’t spend reasoning about the user’s actual work.
That tax lands on plugin authors too. The Postman plugin for Claude Code is pure instructional Markdown: commands, skills, and agents that teach Claude the full Postman API lifecycle. There’s no runtime to profile and no binary to shrink. Its entire footprint is context-window tokens, paid inside every user’s session.
I recently ran a token-usage review on the plugin and shipped an optimization pass. The headline numbers: the plugin’s largest skill is now 60% lighter per trigger, the always-on overhead every session pays dropped by 20%, and a typical “explore an API and generate a client” session starts roughly 3,600 tokens lighter — about 65% less plugin overhead before any work happens. Here’s where the savings came from.
Where a Claude Code plugin spends tokens #
A plugin built from Markdown spends tokens in three distinct ways, and they’re not equally expensive:
Always-on cost. Every skill, command, and agentdescription
in the YAML front matter is injected intoevery session’ssystem prompt, whether or not the user touches Postman that day. This is the most expensive token in the plugin: every user pays it, every session.Per-trigger cost. When Claude decides a skill is relevant, the entireSKILL.md
body loads into context. A 19 KB skill costs roughly 4,800 tokens every time it fires, even if the user only needed a third of it. This layered model is documented in theClaude Code skills docs.Runtime cost. Tool output, async polling loops, and verbose narration while a command runs. Tool schemas add up fast here too — adiscussion on the MCP specification repomeasured roughly 1,000 tokens per complex tool definition.
60% lighter skills with progressive disclosure #
The biggest saving came from the per-trigger cost. Anthropic’s engineering team describes progressive disclosure as the foundational pattern of the Agent Skills standard: metadata loads at startup, the SKILL.md
body loads when relevant, and bundled reference files load only when a specific step needs them.
A skill doesn’t need to front-load every rule it might ever apply. It needs the workflow, plus pointers to detailed rules that Claude reads at the step that needs them:
## Step 4: Generate the client code
Before writing any code, read `references/code-generation.md` in this
skill's directory. It contains the full rule catalog for idiomatic
client generation.
We applied this split to the plugin’s two largest skills:
| Skill | Before | After | Saving per trigger |
|---|---|---|---|
postman-context |
|||
| ~4,760 tokens | ~1,930 tokens | ~2,800 tokens (60%) | |
generate-spec |
|||
| ~2,640 tokens | ~1,800 tokens | ~840 tokens (32%) |
No content was deleted. The detailed rules and templates are intact, deferred rather than removed. A user who asks “find me an email API” no longer pays ~2,800 tokens for code-generation rules they aren’t using. A user who does generate code pays the same total as before.
100% of the routing skill, gone #
The plugin shipped a postman-routing
skill (roughly 835 tokens) whose trigger was “use when user mentions APIs.” That’s broad enough to fire in nearly any backend coding session, Postman-related or not. Its body was a routing table that restated what every command’s description
already tells Claude.
Modern Claude Code routes natively by matching user intent against component descriptions, so the skill was pure duplicate state. We deleted it.
Saving: ~835 tokens in every session where it fired — and given that trigger, it fired in most sessions in API codebases.
20% off the always-on overhead #
Several command descriptions enumerated long quoted trigger-phrase lists:
description: Run Postman collection tests using Postman CLI - use when
user says "run tests", "run collection", "run my postman tests",
"verify changes", "check if tests pass", or wants to execute API
test suites after code changes
Claude’s router doesn’t need a phrasebook. It needs the capability and when to use it:
description: Run Postman Collection tests with the Postman CLI and
report failures. Use after code changes or when the user asks to
run API tests.
Rewriting these, combined with the routing-skill removal, shrank the always-on description block from 3,182 to 2,562 bytes — a 20% reduction worth ~155 tokens in every session of every plugin user. That makes them the highest-leverage bytes in the repo. Anthropic’s
Agent Skills announcementmakes the same point: discovery metadata works best as one or two tight sentences.
90% fewer tool schemas for subagents #
Every MCP-backed command and the plugin’s readiness-analyzer agent previously declared a wildcard, granting access to all 100+ tools on the Postman MCP Server. Each now lists exactly what it uses. The readiness analyzer went from 111 tools to 11 — 90% fewer — and the setup
command declares six.
For subagents and clients that resolve tool schemas from the allowlist, that’s an order of magnitude fewer schemas loaded into context. It also keeps the model from wandering into unrelated tools mid-command, which matches Anthropic’s guidance that overlapping tool sets create ambiguity. As a bonus, the audit surfaced three latent permission bugs where commands were instructed to write or edit files without the permissions to do so. Wildcards hide that class of bug; explicit lists make it visible in review.
Less chatter from async workflows #
Some Postman MCP Server tools return HTTP 202 and require polling for completion. Left to its own devices, the model will happily narrate every poll, and all of it accumulates in context. The affected commands now carry two lines of instruction: poll with increasing waits (2s, then 4s, then 8s), and report only the final outcome. Fewer round-trips, less narration, same result.
What it adds up to #
| Change | Who pays today | Saving |
|---|---|---|
| Routing skill removed | Nearly every session in an API codebase | ~835 tokens/session (100% of the skill) |
| Description trims | Every session, every user | ~155 tokens/session (20% of always-on overhead) |
postman-context split |
||
| Every session that triggers the skill | up to ~2,800 tokens/trigger (60%) | |
generate-spec split |
||
| Every session that triggers the skill | up to ~840 tokens/trigger (32%) | |
Scoped allowed-tools |
||
| Subagent spawns; eager- clients | 90% fewer schemas loaded | |
| Polling guidance | Long-running async commands | variable; fewer round-trips and less narration |
A typical “explore an API and generate a client” session that previously loaded the routing skill plus the full postman-context
skill now starts roughly 3,600 tokens (about 65%) lighter. A session that never touches Postman saves ~990 tokens it used to spend on routing overhead. We also updated the plugin’s contributor docs so the conventions stick: descriptions stay short, bulky skill content goes in references/
, and allowed-tools
lists explicit tool names.
Real-world test: API spec drift detection end-to-end #
Numbers in a table are one thing. Watching them play out on a real task is more useful.
To validate the optimization work, I ran the same prompt against a live GitHub repository using both the pre-optimization and post-optimization versions of the plugin. The task was a non-trivial agentic workflow: scan a workspace for API spec drift, validate the contract, and open a pull request with any code fixes.
look at my workspace. Identify all API spec drift. Ensure the API contract is valid.
Open a PR in https://github.com/buildwithtalia/enterprise-resource-planning
with any code fixes.
Pre-optimization (v1.1.x)
Total cost: $3.60
Total duration (API): 15m 52s
Total duration (wall): 17m 10s
Total code changes: 1159 lines added, 13 lines removed
Usage by model:
claude-haiku-4-5: 104.3k input, 22.8k output, 0 cache read, 0 cache write ($0.2181)
claude-sonnet-4-6: 1.8k input, 60.5k output, 4.5m cache read, 298.7k cache write ($3.38)
Actual tokens of new work (input + cache write + output, excluding re-reads): ~361k tokens for Sonnet alone, plus ~127k for Haiku background tasks.
Post-optimization (v1.2.0)
Total cost: $2.64
Total duration (API): 7m 47s
Total duration (wall): 9m 12s
Total code changes: 119 lines added, 12 lines removed
Usage by model:
claude-haiku-4-5: 482 input, 18 output, 0 cache read, 0 cache write ($0.0006)
claude-sonnet-4-6: 395 input, 31.4k output, 5.5m cache read, 137.3k cache write ($2.64)
Actual tokens of new work: ~169k tokens (395 input + 137.3k cache write + 31.4k output). Haiku dropped to ~500 tokens — effectively background housekeeping.
What the numbers say
| Metric | Pre-optimization | Post-optimization | Change |
|---|---|---|---|
| Session cost | $3.60 | $2.64 | 27% cheaper |
| Wall time | 17m 10s | 9m 12s | ~46% faster |
| Actual new tokens (Sonnet) | ~361k | ~169k | ~53% fewer |
| Cache efficiency (read:write ratio) | ~15:1 | ~40:1 | 2.7× better |
The 5.5 million cache reads in the post-optimization session might look alarming until you remember that cache reads cost about 10% of input tokens. The 40:1 read-to-write ratio means the context window was stable across turns — almost nothing had to be re-cached. Of the $2.64, cache reads account for roughly $1.65, cache writes ~$0.51, and output tokens ~$0.47. That’s exactly what a well-cached session should look like.
There’s also a more interesting data point buried in the code-changes column: the pre-optimization session made 1,159 line changes; the post-optimization session made 119. The optimized plugin didn’t just use fewer tokens — it used them more precisely, producing a tighter, more surgical diff for the same task. Less context noise, more signal.
Takeaways for plugin authors #
If you maintain or are about to publish a Claude Code plugin, these are the rules I’d start with:
Treat front-matter descriptions as the most expensive real estate you own. They’re injected into every session. One or two sentences: what it does, when to use it.Progressive disclosure beats monolithic skills. KeepSKILL.md
to the workflow, around 6 KB or less, and move templates, rule catalogs, and edge-case handling toreferences/*.md
files the skill reads on demand.Don’t build what the harness already does. A routing skill that duplicates Claude’s native description-based routing costs tokens twice and creates a maintenance hazard.Scope It’s least-privilege hygiene, it loads fewer schemas where that matters, and the audit itself tends to find permission bugs.allowed-tools
to what each component calls.Make polling cheap. Any async workflow needs explicit backoff and final-result-only reporting, or the model narrates every poll.
What I like about this kind of token optimization work is that none of it required clever engineering. It required measuring where the tokens go and being honest about which ones earn their place. The same context-engineering discipline Anthropic recommends for agents applies one level down, to the tooling we hand them.
All of these savings are live today: install the latest version of the Postman plugin from the Claude Plugin Marketplace and your next session picks them up automatically. If you’re building your own plugin, clone the plugin repo to see the patterns in place, then check what your own skills cost per trigger. If your largest SKILL.md
is over 10 KB, try the references/
split and compare your /context
output before and after.
Resources #
Postman plugin on the Claude Plugin MarketplacePostman plugin for Claude Code on GitHubPostman MCP Server on GitHubAnnouncing the Postman plugin for Claude CodeEffective context engineering for AI agents (Anthropic)Equipping agents for the real world with Agent Skills (Anthropic)Agent Skills open standardClaude Code skills documentationMCP tool schema token overhead discussion