{"slug": "measure-your-mcp-server-s-token-tax-in-60-seconds", "title": "Measure Your MCP Server's Token Tax in 60 Seconds", "summary": "A developer has created a 60-second audit script to measure the \"MCP server token tax\"—the context budget consumed by tool definitions before an agent performs any useful work. Running the script against the real filesystem MCP server revealed 14 tools consuming 2,638 tokens, approximately 1.3% of a 200K context window. The measurement, performed using `tiktoken`'s `o200k_base` encoding, aims to provide developers with actual per-tool costs rather than relying on repeated industry figures.", "body_md": "The **MCP server token tax** is the context budget every tool definition eats *before* your agent does a single useful thing. To measure it, pull the server's `tools/list`\n\nJSON and tokenize each definition. Claude Code's Tool Search defers loading — it doesn't reduce the tax. Run the 60-second audit below and you'll see your real per-tool cost instead of repeating someone else's number.\n\n**In short:** the MCP server token tax is the context budget every tool definition eats before your agent does anything. To measure it, pull the server's `tools/list`\n\nand tokenize each definition with `tiktoken`\n\n. My run of the real filesystem server: 14 tools, 2,638 tokens, ~1.3% of a 200K window.\n\nAI disclosure:I wrote`mcp_token_tax.py`\n\nwith AI assistance and ran it myself before publishing. Every number below is pasted from a real run of that script, or it's an external figure with a dated link next to it. I label which is which.\n\nYou've seen the figure quoted everywhere this spring: \"the GitHub MCP server costs you tens of thousands of tokens before you ask anything.\" It gets repeated in threads, in newsletters, in conference hallway chatter. Here's a question almost nobody answers when they quote it: *with which tokenizer, against which tools/list?*\n\nI didn't want to repeat a number. I wanted to measure one. So I drove the real, published filesystem MCP server, captured its actual `tools/list`\n\n, and counted. The answer surprised me, and it's the reason this post exists.\n\n**TL;DR**\n\n`@modelcontextprotocol/server-filesystem`\n\nserver: `tiktoken`\n\n. Copy it, run it, audit your own stack.This is the first post in a small thread on **MCP FinOps: measure before you cut**. It sits next to the control side of my work: a [hard spend-cap that stops a runaway agent loop](https://finops.spinov.online/blog/a-47k-agent-loop-spend-cap/) and the [pre-execution gate that refuses a bad agent action before it runs](https://finops.spinov.online/blog/pre-execution-gate-for-ai-agents/). Those stop bad actions. This one just gives you a number, because you can't cut what you haven't measured.\n\nA tool definition is text. Name, title, a human-readable description, the JSON Schema for its inputs (and now, often, an output schema and annotations). When you connect an MCP server, the host serializes all of that and injects it into the model's context so it knows the tool exists and how to call it.\n\nThat text doesn't get charged once. It rides along on turn after turn, because the model has to keep \"seeing\" the tools to use them. Ten tools you never call still sit in the window, quietly, on every request. That's the tax: rent on capability you've declared but may not be using.\n\nTwo costs come out of it. The obvious one is dollars: input tokens you pay for repeatedly. The sneakier one is *room*. Every token of definition is a token not available for the actual conversation, the retrieved docs, the file you pasted. The MCP spec is moving toward a stateless core in the [2026-07-28 release candidate](https://blog.modelcontextprotocol.io/posts/2026-07-28-release-candidate/), which reshapes a lot — but it doesn't change the basic physics here. Definitions still have to reach the model somehow.\n\nKen Alger named the downstream symptom plainly in his [March 2026 piece on multi-agent MCP](https://www.kenwalger.com/blog/ai/mcp-multi-agent-orchestration-forensics/): \"A single agent juggling too many tools often suffers from… Tool Confusion: choosing the wrong function when multiple tools are available,\" plus \"Latency and Cost.\" Tokens are one face of that. Accuracy is the other. Anthropic's own testing in that same November post showed a model's tool-selection accuracy climbing from 49% to 74% once it *stopped* carrying every definition at once. Fewer tools in context, better choices. The tax isn't only financial.\n\nHere's the whole thing. It does one job: read a `tools/list`\n\n, tokenize each tool with `tiktoken`\n\n's `o200k_base`\n\nencoding (the gpt-4o family encoding — swap in `cl100k_base`\n\nfor older models), and print a per-tool table with the share of your context window and a dollars-per-round estimate.\n\nYou feed it tools two ways. Point it at a published stdio server and it'll do the JSON-RPC handshake and capture the *real* `tools/list`\n\nlive — keyless, read-only, and it never calls `tools/call`\n\n, so nothing executes. Or hand it a JSON fixture you saved earlier, for a deterministic run that reproduces byte-for-byte.\n\n``` bash\n#!/usr/bin/env python3\n\"\"\"mcp_token_tax.py - measure the token tax of an MCP server's tool definitions.\"\"\"\nimport argparse, json, subprocess, sys, threading\ntry:\n    import tiktoken\nexcept ImportError:\n    sys.exit(\"tiktoken is required:  pip install tiktoken\")\n\ndef serialize_tool(tool: dict) -> str:\n    # The text a host puts in context for one tool, as compact JSON.\n    # Hosts frame this differently, so it's a close approximation, not a\n    # provider's billing meter. Counted the same way for every tool, so the\n    # ranking and relative shares hold even where the absolute number drifts.\n    return json.dumps(tool, ensure_ascii=False, separators=(\",\", \":\"))\n\ndef measure(tools, encoding_name=\"o200k_base\"):\n    enc = tiktoken.get_encoding(encoding_name)\n    rows = []\n    for t in tools:\n        rows.append({\n            \"name\": t.get(\"name\", \"<unnamed>\"),\n            \"tokens\": len(enc.encode(serialize_tool(t))),\n            \"n_params\": len((t.get(\"inputSchema\") or {}).get(\"properties\", {})),\n        })\n    rows.sort(key=lambda r: r[\"tokens\"], reverse=True)\n    return rows\n\ndef from_fixture(path):\n    data = json.load(open(path, encoding=\"utf-8\"))\n    return data.get(\"serverInfo\", {}), data[\"tools\"]\n\ndef from_server(server_cmd, timeout=90):\n    proc = subprocess.Popen(server_cmd.split(), stdin=subprocess.PIPE,\n        stdout=subprocess.PIPE, stderr=subprocess.PIPE, text=True, bufsize=1)\n    def send(o): proc.stdin.write(json.dumps(o) + \"\\n\"); proc.stdin.flush()\n    def read_id(tid, budget=60):\n        for _ in range(80):\n            box = {}\n            th = threading.Thread(target=lambda: box.update(line=proc.stdout.readline()),\n                                  daemon=True)\n            th.start(); th.join(budget)\n            line = box.get(\"line\", \"\")\n            if not line: continue\n            try: msg = json.loads(line)\n            except json.JSONDecodeError: continue\n            if msg.get(\"id\") == tid: return msg\n        return None\n    send({\"jsonrpc\":\"2.0\",\"id\":1,\"method\":\"initialize\",\"params\":{\n        \"protocolVersion\":\"2025-06-18\",\"capabilities\":{},\n        \"clientInfo\":{\"name\":\"mcp-token-tax\",\"version\":\"0.1\"}}})\n    init = read_id(1, timeout)\n    send({\"jsonrpc\":\"2.0\",\"method\":\"notifications/initialized\",\"params\":{}})\n    send({\"jsonrpc\":\"2.0\",\"id\":2,\"method\":\"tools/list\",\"params\":{}})\n    listed = read_id(2, timeout)\n    proc.terminate()\n    if listed is None:\n        sys.exit(\"Could not get tools/list. \" + (proc.stderr.read() or \"\")[:500])\n    return (init or {}).get(\"result\", {}).get(\"serverInfo\", {}), listed[\"result\"][\"tools\"]\n\ndef report(server, rows, ctx, price):\n    total = sum(r[\"tokens\"] for r in rows)\n    print(f\"MCP token tax  -  {server.get('name','?')} v{server.get('version','?')}\")\n    print(f\"encoding: o200k_base (gpt-4o family)  |  {len(rows)} tools\\n\")\n    print(f\"{'tool':<28}{'tokens':>8}{'% toolset':>11}{'params':>8}\")\n    print(\"-\" * 55)\n    for r in rows:\n        print(f\"{r['name']:<28}{r['tokens']:>8}{r['tokens']/total*100:>10.1f}%{r['n_params']:>8}\")\n    print(\"-\" * 55)\n    print(f\"{'TOTAL':<28}{total:>8}{'100.0%':>11}\\n\")\n    print(\"Context budget burned before the first prompt:\")\n    print(f\"  total definition tokens : {total:,}\")\n    print(f\"  share of a {ctx:,}-token window : {total/ctx*100:.1f}%\")\n    print(f\"  est. cost per round @ ${price:.2f}/1M input : ${total/1_000_000*price:.4f}\")\n\ndef main():\n    ap = argparse.ArgumentParser()\n    g = ap.add_mutually_exclusive_group(required=True)\n    g.add_argument(\"--fixture\"); g.add_argument(\"--server\")\n    ap.add_argument(\"--ctx\", type=int, default=200_000)\n    ap.add_argument(\"--price\", type=float, default=3.00)\n    a = ap.parse_args()\n    server, tools = (from_fixture(a.fixture) if a.fixture else from_server(a.server))\n    report(server, measure(tools), a.ctx, a.price)\n\nif __name__ == \"__main__\":\n    main()\n```\n\nTwo ways to run it. Live against the real npm package:\n\n```\npip install tiktoken\npython3 mcp_token_tax.py \\\n  --server \"npx -y @modelcontextprotocol/server-filesystem@latest /tmp\"\n```\n\nOr against a `tools/list`\n\nyou've already saved (deterministic):\n\n```\npython3 mcp_token_tax.py --fixture filesystem_toolslist.json --price 3.00\n```\n\nI ran the live and the fixture paths side by side; for the same captured version they produce identical token counts, which is the point. The fixture is captured real output, not a guess. One caveat: `@latest`\n\nis a moving target — when the package ships new descriptions or schema fields, the live count will drift from the fixture. Pin a version (`@2025...`\n\n) if you need a number you can diff next month.\n\nHere's the verbatim run against the real `@modelcontextprotocol/server-filesystem`\n\n(server reports itself as `secure-filesystem-server v0.2.0`\n\n), captured today:\n\n```\nMCP token tax  -  secure-filesystem-server v0.2.0\nencoding: o200k_base (gpt-4o family)  |  14 tools\n\ntool                          tokens  % toolset  params\n-------------------------------------------------------\nread_text_file                   250       9.5%       3\nedit_file                        239       9.1%       3\nsearch_files                     212       8.0%       3\nread_multiple_files              204       7.7%       1\ndirectory_tree                   196       7.4%       2\nlist_directory_with_sizes        195       7.4%       2\nmove_file                        186       7.1%       2\nread_media_file                  185       7.0%       1\nread_file                        173       6.6%       3\ncreate_directory                 171       6.5%       1\nwrite_file                       168       6.4%       2\nlist_directory                   160       6.1%       1\nget_file_info                    156       5.9%       1\nlist_allowed_directories         143       5.4%       0\n-------------------------------------------------------\nTOTAL                           2638     100.0%\n\nContext budget burned before the first prompt:\n  total definition tokens : 2,638\n  share of a 200,000-token window : 1.3%\n  est. cost per round @ $3.00/1M input : $0.0079\n  heaviest tool           : read_text_file (250 tok, 9.5%)\n  lightest tool           : list_allowed_directories (143 tok)\n```\n\nTwo units honesty before you quote that 1.3% at me. I count with `o200k_base`\n\n, the gpt-4o tokenizer — Claude ships no public tokenizer, so on a Claude 200K window this is a *proxy*, and the real Claude-token figure will differ a little. And I count the whole tool object as compact JSON; a host that injects only `name`\n\n/`description`\n\n/`inputSchema`\n\nwould see closer to ~1,640 tokens here, while the same objects pretty-printed run ~4,036. So the honest band for \"what reaches the model\" is roughly 1.6K–4K, and 2,638 is my single defensible point inside it, counted identically for every tool. The ranking and the shares below are rock-solid; treat the absolute total as an order-of-magnitude, not a meter.\n\nNow the honest part. **2,638 tokens. 1.3% of a 200K window.** For one server, that is *not* scary. If you came here expecting me to confirm that any single MCP server is a five-alarm fire, I can't — not this one. The filesystem server is lean: fourteen tools, terse descriptions, simple schemas. Its single most expensive tool, `read_text_file`\n\n, costs 250 tokens, mostly because its description spells out the `head`\n\n/`tail`\n\nbehavior in prose.\n\nSo where does the panic come from? Two places, and both are real.\n\nFirst, **verbose text drives the cost — prose and schema both, not the raw parameter count.** Compare `read_text_file`\n\n(250 tokens, a long `head`\n\n/`tail`\n\ndescription) with `list_allowed_directories`\n\n(143 tokens, almost no prose and zero params). But don't over-credit descriptions alone: across these 14 tools, token count tracks the size of the input schema (correlation ~0.80) more tightly than the length of the description (~0.36). `edit_file`\n\nproves it — a short description but a fat nested `edits`\n\nschema lands it at 239 tokens, second-heaviest in the table. So the rule isn't \"watch your prose,\" it's \"watch your total surface area\": paragraph descriptions *and* sprawling schemas with big enum lists both bill. Servers that ship both pay a tax this lean reference server doesn't. That's why the heavy ones in the wild, GitHub and Slack, clock in an order of magnitude higher.\n\nSecond, **it compounds.** One server at 1.3% is nothing. Now stack the real ones. Anthropic published actual measurements in November 2025: GitHub ~26K tokens, Slack ~21K, and a five-server setup of 58 tools landing at \"approximately 55K tokens before the conversation even starts,\" with 134K \"before optimization\" on their internal deployment ([Anthropic engineering, 2025-11-24](https://www.anthropic.com/engineering/advanced-tool-use)). Those are *their* numbers, measured by them — I'm quoting, not claiming. But the shape is exactly what my run predicts in miniature: a single lean server is cheap, ten chatty servers are a tax bracket.\n\nFor the FinOps-minded: the dollar line is the ROI hook. At $3/1M input tokens, one filesystem server costs $0.0079 a round. Trivial. But run it at the per-turn frequency of an agent loop across a 55K-token multi-server stack, and you're paying for 55K tokens of *overhead* on every single call, forever, whether or not the model touches those tools. That's the math worth checking against your own bill. Plug your real `--price`\n\nand `--ctx`\n\nin and read the bottom line.\n\nClaude Code shipped Tool Search, and it's genuinely good. When your tool definitions would exceed roughly 10% of the context window, it stops loading all of them up front; it keeps a lightweight index and pulls the full definition for a tool only when the model decides it needs it. Anthropic reports about an [85% reduction in tokens carried](https://www.anthropic.com/engineering/advanced-tool-use) with the full library still reachable. Real win. Use it.\n\nBut notice the verb. It *defers* loading. It doesn't delete the definition. When the model actually reaches for `create_or_update_file`\n\n, that tool's full schema enters context at that moment — you pay the tax then, plus the cost of the search step that found it. The total bill for tools you use is roughly the same; what changes is you stop paying for the 50 tools you *don't* use this turn.\n\nThat's why \"we enabled Tool Search, we're done\" is the trap. Lazy loading is a great default. It is not a measurement, and it is not a decision. It quietly amortizes a cost you never looked at. My contrarian line, and I'll happily be wrong in the comments: **tracking the tax away is not controlling it.** The control move is to open the table above, find your `read_text_file`\n\n— the one fat tool whose description you can halve, or the four near-duplicate read tools you can collapse into one — and cut. Tool Search makes the bloat invisible. The audit makes it *editable.* Measure first, then cut, then let lazy loading handle what's left.\n\nI'd rather you trust the small honest number than oversell it.\n\n`serialize_tool`\n\ncounts the `outputSchema`\n\n, `annotations`\n\n, and `execution`\n\n— as compact JSON. Two honest biases ride along. Those extra fields are ~38% of the total here, and a host that only sends `name`\n\n/`description`\n\n/`inputSchema`\n\nto the model wouldn't pay for them (that's the ~1,640 core I mentioned above). Meanwhile compact JSON has zero whitespace, so it `read_text_file`\n\nearns its tokens with a description that prevents misuse. Sometimes the verbose tool is the correct one. The number tells you the Pick the heaviest MCP server in your config, the one with 40-plus tools and paragraph descriptions, and run the audit against it. I'd bet the cost is wildly uneven: a handful of tools eating a third of the budget while the rest are rounding errors. That asymmetry is where the cut lives.\n\nSo here's the real open question I keep hitting and haven't solved cleanly: **at what point is collapsing five granular tools into one fat, well-described tool a net win** — fewer definitions in context, but a longer single description and more tool-confusion risk inside it? I have hunches, not a rule. If you've measured both sides of that trade on a real server, drop the numbers in the comments — I read every one. And follow along; the next post in this thread takes the audit to a full multi-server stack and tries to find that break-even.", "url": "https://wpnews.pro/news/measure-your-mcp-server-s-token-tax-in-60-seconds", "canonical_source": "https://dev.to/alex_spinov/measure-your-mcp-servers-token-tax-in-60-seconds-geo", "published_at": "2026-06-12 18:16:42+00:00", "updated_at": "2026-06-12 18:44:20.110379+00:00", "lang": "en", "topics": ["ai-tools", "large-language-models", "ai-agents"], "entities": ["MCP", "Claude Code", "GitHub", "tiktoken"], "alternates": {"html": "https://wpnews.pro/news/measure-your-mcp-server-s-token-tax-in-60-seconds", "markdown": "https://wpnews.pro/news/measure-your-mcp-server-s-token-tax-in-60-seconds.md", "text": "https://wpnews.pro/news/measure-your-mcp-server-s-token-tax-in-60-seconds.txt", "jsonld": "https://wpnews.pro/news/measure-your-mcp-server-s-token-tax-in-60-seconds.jsonld"}}