Measure Your MCP Server's Token Tax in 60 Seconds

A developer has created a 60-second audit script to measure the "MCP server token tax"—the context budget consumed by tool definitions before an agent performs any useful work. Running the script against the real filesystem MCP server revealed 14 tools consuming 2,638 tokens, approximately 1.3% of a 200K context window. The measurement, performed using `tiktoken`'s `o200k_base` encoding, aims to provide developers with actual per-tool costs rather than relying on repeated industry figures.

The MCP server token tax is the context budget every tool definition eats before your agent does a single useful thing. To measure it, pull the server's tools/list JSON and tokenize each definition. Claude Code's Tool Search defers loading — it doesn't reduce the tax. Run the 60-second audit below and you'll see your real per-tool cost instead of repeating someone else's number. In short: the MCP server token tax is the context budget every tool definition eats before your agent does anything. To measure it, pull the server's tools/list and tokenize each definition with tiktoken . My run of the real filesystem server: 14 tools, 2,638 tokens, ~1.3% of a 200K window. AI disclosure:I wrote mcp token tax.py with AI assistance and ran it myself before publishing. Every number below is pasted from a real run of that script, or it's an external figure with a dated link next to it. I label which is which. You've seen the figure quoted everywhere this spring: "the GitHub MCP server costs you tens of thousands of tokens before you ask anything." It gets repeated in threads, in newsletters, in conference hallway chatter. Here's a question almost nobody answers when they quote it: with which tokenizer, against which tools/list? I didn't want to repeat a number. I wanted to measure one. So I drove the real, published filesystem MCP server, captured its actual tools/list , and counted. The answer surprised me, and it's the reason this post exists. TL;DR @modelcontextprotocol/server-filesystem server: tiktoken . Copy it, run it, audit your own stack.This is the first post in a small thread on MCP FinOps: measure before you cut . It sits next to the control side of my work: a hard spend-cap that stops a runaway agent loop https://finops.spinov.online/blog/a-47k-agent-loop-spend-cap/ and the pre-execution gate that refuses a bad agent action before it runs https://finops.spinov.online/blog/pre-execution-gate-for-ai-agents/ . Those stop bad actions. This one just gives you a number, because you can't cut what you haven't measured. A tool definition is text. Name, title, a human-readable description, the JSON Schema for its inputs and now, often, an output schema and annotations . When you connect an MCP server, the host serializes all of that and injects it into the model's context so it knows the tool exists and how to call it. That text doesn't get charged once. It rides along on turn after turn, because the model has to keep "seeing" the tools to use them. Ten tools you never call still sit in the window, quietly, on every request. That's the tax: rent on capability you've declared but may not be using. Two costs come out of it. The obvious one is dollars: input tokens you pay for repeatedly. The sneakier one is room . Every token of definition is a token not available for the actual conversation, the retrieved docs, the file you pasted. The MCP spec is moving toward a stateless core in the 2026-07-28 release candidate https://blog.modelcontextprotocol.io/posts/2026-07-28-release-candidate/ , which reshapes a lot — but it doesn't change the basic physics here. Definitions still have to reach the model somehow. Ken Alger named the downstream symptom plainly in his March 2026 piece on multi-agent MCP https://www.kenwalger.com/blog/ai/mcp-multi-agent-orchestration-forensics/ : "A single agent juggling too many tools often suffers from… Tool Confusion: choosing the wrong function when multiple tools are available," plus "Latency and Cost." Tokens are one face of that. Accuracy is the other. Anthropic's own testing in that same November post showed a model's tool-selection accuracy climbing from 49% to 74% once it stopped carrying every definition at once. Fewer tools in context, better choices. The tax isn't only financial. Here's the whole thing. It does one job: read a tools/list , tokenize each tool with tiktoken 's o200k base encoding the gpt-4o family encoding — swap in cl100k base for older models , and print a per-tool table with the share of your context window and a dollars-per-round estimate. You feed it tools two ways. Point it at a published stdio server and it'll do the JSON-RPC handshake and capture the real tools/list live — keyless, read-only, and it never calls tools/call , so nothing executes. Or hand it a JSON fixture you saved earlier, for a deterministic run that reproduces byte-for-byte. bash /usr/bin/env python3 """mcp token tax.py - measure the token tax of an MCP server's tool definitions.""" import argparse, json, subprocess, sys, threading try: import tiktoken except ImportError: sys.exit "tiktoken is required: pip install tiktoken" def serialize tool tool: dict - str: The text a host puts in context for one tool, as compact JSON. Hosts frame this differently, so it's a close approximation, not a provider's billing meter. Counted the same way for every tool, so the ranking and relative shares hold even where the absolute number drifts. return json.dumps tool, ensure ascii=False, separators= ",", ":" def measure tools, encoding name="o200k base" : enc = tiktoken.get encoding encoding name rows = for t in tools: rows.append { "name": t.get "name", "<unnamed " , "tokens": len enc.encode serialize tool t , "n params": len t.get "inputSchema" or {} .get "properties", {} , } rows.sort key=lambda r: r "tokens" , reverse=True return rows def from fixture path : data = json.load open path, encoding="utf-8" return data.get "serverInfo", {} , data "tools" def from server server cmd, timeout=90 : proc = subprocess.Popen server cmd.split , stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, bufsize=1 def send o : proc.stdin.write json.dumps o + "\n" ; proc.stdin.flush def read id tid, budget=60 : for in range 80 : box = {} th = threading.Thread target=lambda: box.update line=proc.stdout.readline , daemon=True th.start ; th.join budget line = box.get "line", "" if not line: continue try: msg = json.loads line except json.JSONDecodeError: continue if msg.get "id" == tid: return msg return None send {"jsonrpc":"2.0","id":1,"method":"initialize","params":{ "protocolVersion":"2025-06-18","capabilities":{}, "clientInfo":{"name":"mcp-token-tax","version":"0.1"}}} init = read id 1, timeout send {"jsonrpc":"2.0","method":"notifications/initialized","params":{}} send {"jsonrpc":"2.0","id":2,"method":"tools/list","params":{}} listed = read id 2, timeout proc.terminate if listed is None: sys.exit "Could not get tools/list. " + proc.stderr.read or "" :500 return init or {} .get "result", {} .get "serverInfo", {} , listed "result" "tools" def report server, rows, ctx, price : total = sum r "tokens" for r in rows print f"MCP token tax - {server.get 'name','?' } v{server.get 'version','?' }" print f"encoding: o200k base gpt-4o family | {len rows } tools\n" print f"{'tool':<28}{'tokens': 8}{'% toolset': 11}{'params': 8}" print "-" 55 for r in rows: print f"{r 'name' :<28}{r 'tokens' : 8}{r 'tokens' /total 100: 10.1f}%{r 'n params' : 8}" print "-" 55 print f"{'TOTAL':<28}{total: 8}{'100.0%': 11}\n" print "Context budget burned before the first prompt:" print f" total definition tokens : {total:,}" print f" share of a {ctx:,}-token window : {total/ctx 100:.1f}%" print f" est. cost per round @ ${price:.2f}/1M input : ${total/1 000 000 price:.4f}" def main : ap = argparse.ArgumentParser g = ap.add mutually exclusive group required=True g.add argument "--fixture" ; g.add argument "--server" ap.add argument "--ctx", type=int, default=200 000 ap.add argument "--price", type=float, default=3.00 a = ap.parse args server, tools = from fixture a.fixture if a.fixture else from server a.server report server, measure tools , a.ctx, a.price if name == " main ": main Two ways to run it. Live against the real npm package: pip install tiktoken python3 mcp token tax.py \ --server "npx -y @modelcontextprotocol/server-filesystem@latest /tmp" Or against a tools/list you've already saved deterministic : python3 mcp token tax.py --fixture filesystem toolslist.json --price 3.00 I ran the live and the fixture paths side by side; for the same captured version they produce identical token counts, which is the point. The fixture is captured real output, not a guess. One caveat: @latest is a moving target — when the package ships new descriptions or schema fields, the live count will drift from the fixture. Pin a version @2025... if you need a number you can diff next month. Here's the verbatim run against the real @modelcontextprotocol/server-filesystem server reports itself as secure-filesystem-server v0.2.0 , captured today: MCP token tax - secure-filesystem-server v0.2.0 encoding: o200k base gpt-4o family | 14 tools tool tokens % toolset params ------------------------------------------------------- read text file 250 9.5% 3 edit file 239 9.1% 3 search files 212 8.0% 3 read multiple files 204 7.7% 1 directory tree 196 7.4% 2 list directory with sizes 195 7.4% 2 move file 186 7.1% 2 read media file 185 7.0% 1 read file 173 6.6% 3 create directory 171 6.5% 1 write file 168 6.4% 2 list directory 160 6.1% 1 get file info 156 5.9% 1 list allowed directories 143 5.4% 0 ------------------------------------------------------- TOTAL 2638 100.0% Context budget burned before the first prompt: total definition tokens : 2,638 share of a 200,000-token window : 1.3% est. cost per round @ $3.00/1M input : $0.0079 heaviest tool : read text file 250 tok, 9.5% lightest tool : list allowed directories 143 tok Two units honesty before you quote that 1.3% at me. I count with o200k base , the gpt-4o tokenizer — Claude ships no public tokenizer, so on a Claude 200K window this is a proxy , and the real Claude-token figure will differ a little. And I count the whole tool object as compact JSON; a host that injects only name / description / inputSchema would see closer to ~1,640 tokens here, while the same objects pretty-printed run ~4,036. So the honest band for "what reaches the model" is roughly 1.6K–4K, and 2,638 is my single defensible point inside it, counted identically for every tool. The ranking and the shares below are rock-solid; treat the absolute total as an order-of-magnitude, not a meter. Now the honest part. 2,638 tokens. 1.3% of a 200K window. For one server, that is not scary. If you came here expecting me to confirm that any single MCP server is a five-alarm fire, I can't — not this one. The filesystem server is lean: fourteen tools, terse descriptions, simple schemas. Its single most expensive tool, read text file , costs 250 tokens, mostly because its description spells out the head / tail behavior in prose. So where does the panic come from? Two places, and both are real. First, verbose text drives the cost — prose and schema both, not the raw parameter count. Compare read text file 250 tokens, a long head / tail description with list allowed directories 143 tokens, almost no prose and zero params . But don't over-credit descriptions alone: across these 14 tools, token count tracks the size of the input schema correlation ~0.80 more tightly than the length of the description ~0.36 . edit file proves it — a short description but a fat nested edits schema lands it at 239 tokens, second-heaviest in the table. So the rule isn't "watch your prose," it's "watch your total surface area": paragraph descriptions and sprawling schemas with big enum lists both bill. Servers that ship both pay a tax this lean reference server doesn't. That's why the heavy ones in the wild, GitHub and Slack, clock in an order of magnitude higher. Second, it compounds. One server at 1.3% is nothing. Now stack the real ones. Anthropic published actual measurements in November 2025: GitHub ~26K tokens, Slack ~21K, and a five-server setup of 58 tools landing at "approximately 55K tokens before the conversation even starts," with 134K "before optimization" on their internal deployment Anthropic engineering, 2025-11-24 https://www.anthropic.com/engineering/advanced-tool-use . Those are their numbers, measured by them — I'm quoting, not claiming. But the shape is exactly what my run predicts in miniature: a single lean server is cheap, ten chatty servers are a tax bracket. For the FinOps-minded: the dollar line is the ROI hook. At $3/1M input tokens, one filesystem server costs $0.0079 a round. Trivial. But run it at the per-turn frequency of an agent loop across a 55K-token multi-server stack, and you're paying for 55K tokens of overhead on every single call, forever, whether or not the model touches those tools. That's the math worth checking against your own bill. Plug your real --price and --ctx in and read the bottom line. Claude Code shipped Tool Search, and it's genuinely good. When your tool definitions would exceed roughly 10% of the context window, it stops loading all of them up front; it keeps a lightweight index and pulls the full definition for a tool only when the model decides it needs it. Anthropic reports about an 85% reduction in tokens carried https://www.anthropic.com/engineering/advanced-tool-use with the full library still reachable. Real win. Use it. But notice the verb. It defers loading. It doesn't delete the definition. When the model actually reaches for create or update file , that tool's full schema enters context at that moment — you pay the tax then, plus the cost of the search step that found it. The total bill for tools you use is roughly the same; what changes is you stop paying for the 50 tools you don't use this turn. That's why "we enabled Tool Search, we're done" is the trap. Lazy loading is a great default. It is not a measurement, and it is not a decision. It quietly amortizes a cost you never looked at. My contrarian line, and I'll happily be wrong in the comments: tracking the tax away is not controlling it. The control move is to open the table above, find your read text file — the one fat tool whose description you can halve, or the four near-duplicate read tools you can collapse into one — and cut. Tool Search makes the bloat invisible. The audit makes it editable. Measure first, then cut, then let lazy loading handle what's left. I'd rather you trust the small honest number than oversell it. serialize tool counts the outputSchema , annotations , and execution — as compact JSON. Two honest biases ride along. Those extra fields are ~38% of the total here, and a host that only sends name / description / inputSchema to the model wouldn't pay for them that's the ~1,640 core I mentioned above . Meanwhile compact JSON has zero whitespace, so it read text file earns its tokens with a description that prevents misuse. Sometimes the verbose tool is the correct one. The number tells you the Pick the heaviest MCP server in your config, the one with 40-plus tools and paragraph descriptions, and run the audit against it. I'd bet the cost is wildly uneven: a handful of tools eating a third of the budget while the rest are rounding errors. That asymmetry is where the cut lives. So here's the real open question I keep hitting and haven't solved cleanly: at what point is collapsing five granular tools into one fat, well-described tool a net win — fewer definitions in context, but a longer single description and more tool-confusion risk inside it? I have hunches, not a rule. If you've measured both sides of that trade on a real server, drop the numbers in the comments — I read every one. And follow along; the next post in this thread takes the audit to a full multi-server stack and tries to find that break-even.