{"slug": "token-counting-done-right-stop-using-tiktoken-for-claude", "title": "Token Counting Done Right: Stop Using tiktoken for Claude", "summary": "A developer discovered that using OpenAI's tiktoken tokenizer to count tokens for Claude models leads to a 15-20% undercount, causing inaccurate cost estimates and context budgets. The correct approach is to use Claude's dedicated countTokens endpoint with the specific model version, as token counts vary between models and even between model versions. The developer recommends never caching token counts across model changes and using the endpoint to track prompt bloat.", "body_md": "I had a cost estimator that was wrong by 20%, and the reason was embarrassing: I was counting Claude tokens with `tiktoken`\n\n, which is OpenAI's tokenizer. Different model, different tokenizer, different counts. If you are estimating Claude costs or context budgets with a borrowed tokenizer, your numbers are fiction. Here is how to count correctly, and where the wrong way bites.\n\n`tiktoken`\n\ntokenizes for OpenAI models. Claude uses a different tokenizer. They do not agree on how text splits into tokens. On typical English prose, `tiktoken`\n\nundercounts Claude tokens by roughly 15 to 20%. On code or non-English text, the gap is worse, because tokenizers diverge most on the inputs they were not each optimized for.\n\nSo a \"cost estimate\" or \"will this fit in context\" check built on `tiktoken`\n\nis systematically off. It told me a prompt was 8,000 tokens when Claude saw closer to 9,500. Multiply that across a busy day and the budget projection is meaningfully wrong.\n\nClaude has a dedicated endpoint for this, and the SDK wraps it. Counts are model-specific, so you pass the same model you will use for inference:\n\n``` python\nimport Anthropic from \"@anthropic-ai/sdk\";\nconst client = new Anthropic();\n\nconst result = await client.messages.countTokens({\n  model: \"claude-opus-4-8\",\n  messages: [{ role: \"user\", content: contractSource }],\n});\n\nconsole.log(result.input_tokens); // the real count Claude will charge for\n```\n\nThis is the actual count, from the actual tokenizer, for the actual model. No approximation.\n\nOnce you have the real input count, the cost math is straightforward. For Opus 4.8 at $5 per million input tokens:\n\n``` js\nconst tokens = result.input_tokens;\nconst inputCost = (tokens / 1_000_000) * 5; // $5/M for Opus 4.8 input\nconsole.log(`Estimated input cost: $${inputCost.toFixed(4)}`);\n```\n\nIf you are deciding between tiers, the per-million rates that matter in 2026:\n\n| Model | Input $/M | Output $/M |\n|---|---|---|\n| Haiku 4.5 | 1 | 5 |\n| Opus 4.8 | 5 | 25 |\n| Fable 5 | 10 | 50 |\n\nThe count is the same per model only on the input side; remember output tokens dominate cost on generation-heavy tasks, and you do not know those until you run the request.\n\nOne subtlety that surprised me: token counts changed between Claude model versions. The same input text produces a *higher* count on Opus 4.7 than on Opus 4.6, because they count differently. So if you cached a token count from an older model and reused it, you would be wrong again, just less wrong than tiktoken.\n\nThe fix is to never cache a count across a model change. Re-run `countTokens`\n\nagainst the model you are actually using. Do not apply a blanket multiplier to convert between models; the divergence is not uniform.\n\nA handy pattern for \"how many tokens did this change add\" is to count both versions and subtract. The endpoint is stateless, so you just count each and diff:\n\n``` python\nimport { execSync } from \"node:child_process\";\nimport fs from \"node:fs\";\n\nasync function count(text: string): Promise<number> {\n  const r = await client.messages.countTokens({\n    model: \"claude-opus-4-8\",\n    messages: [{ role: \"user\", content: text }],\n  });\n  return r.input_tokens;\n}\n\nconst before = execSync(\"git show HEAD:CLAUDE.md\").toString();\nconst after = fs.readFileSync(\"CLAUDE.md\", \"utf8\");\nconsole.log(`Delta: ${(await count(after)) - (await count(before))} tokens`);\n```\n\nI use this to keep an eye on system-prompt bloat. When a prompt creeps up by a few thousand tokens, that is real money on every cached-miss request, and the diff makes it visible.\n\nThe tokenizer is part of the model. Borrowing another model's tokenizer to estimate counts is like measuring in the wrong units and hoping the error cancels. It does not cancel; it compounds. Use `countTokens`\n\nagainst the exact model, never reuse a count across model versions, and remember output tokens are the unknown that dominates generation cost. It is one API call, it is free, and it is the difference between a budget projection you can trust and one that is off by a fifth.", "url": "https://wpnews.pro/news/token-counting-done-right-stop-using-tiktoken-for-claude", "canonical_source": "https://dev.to/pavelespitia/token-counting-done-right-stop-using-tiktoken-for-claude-383c", "published_at": "2026-06-28 14:41:31+00:00", "updated_at": "2026-06-28 15:03:55.013601+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools"], "entities": ["OpenAI", "Anthropic", "Claude", "tiktoken", "Opus 4.8", "Haiku 4.5", "Fable 5"], "alternates": {"html": "https://wpnews.pro/news/token-counting-done-right-stop-using-tiktoken-for-claude", "markdown": "https://wpnews.pro/news/token-counting-done-right-stop-using-tiktoken-for-claude.md", "text": "https://wpnews.pro/news/token-counting-done-right-stop-using-tiktoken-for-claude.txt", "jsonld": "https://wpnews.pro/news/token-counting-done-right-stop-using-tiktoken-for-claude.jsonld"}}