cd /news/large-language-models/token-counting-done-right-stop-using… · home topics large-language-models article
[ARTICLE · art-42603] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=· neutral

Token Counting Done Right: Stop Using tiktoken for Claude

A developer discovered that using OpenAI's tiktoken tokenizer to count tokens for Claude models leads to a 15-20% undercount, causing inaccurate cost estimates and context budgets. The correct approach is to use Claude's dedicated countTokens endpoint with the specific model version, as token counts vary between models and even between model versions. The developer recommends never caching token counts across model changes and using the endpoint to track prompt bloat.

read3 min views1 publishedJun 28, 2026

I had a cost estimator that was wrong by 20%, and the reason was embarrassing: I was counting Claude tokens with tiktoken

, which is OpenAI's tokenizer. Different model, different tokenizer, different counts. If you are estimating Claude costs or context budgets with a borrowed tokenizer, your numbers are fiction. Here is how to count correctly, and where the wrong way bites.

tiktoken

tokenizes for OpenAI models. Claude uses a different tokenizer. They do not agree on how text splits into tokens. On typical English prose, tiktoken

undercounts Claude tokens by roughly 15 to 20%. On code or non-English text, the gap is worse, because tokenizers diverge most on the inputs they were not each optimized for.

So a "cost estimate" or "will this fit in context" check built on tiktoken

is systematically off. It told me a prompt was 8,000 tokens when Claude saw closer to 9,500. Multiply that across a busy day and the budget projection is meaningfully wrong.

Claude has a dedicated endpoint for this, and the SDK wraps it. Counts are model-specific, so you pass the same model you will use for inference:

import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();

const result = await client.messages.countTokens({
  model: "claude-opus-4-8",
  messages: [{ role: "user", content: contractSource }],
});

console.log(result.input_tokens); // the real count Claude will charge for

This is the actual count, from the actual tokenizer, for the actual model. No approximation.

Once you have the real input count, the cost math is straightforward. For Opus 4.8 at $5 per million input tokens:

const tokens = result.input_tokens;
const inputCost = (tokens / 1_000_000) * 5; // $5/M for Opus 4.8 input
console.log(`Estimated input cost: $${inputCost.toFixed(4)}`);

If you are deciding between tiers, the per-million rates that matter in 2026:

Model Input $/M Output $/M
Haiku 4.5 1 5
Opus 4.8 5 25
Fable 5 10 50

The count is the same per model only on the input side; remember output tokens dominate cost on generation-heavy tasks, and you do not know those until you run the request.

One subtlety that surprised me: token counts changed between Claude model versions. The same input text produces a higher count on Opus 4.7 than on Opus 4.6, because they count differently. So if you cached a token count from an older model and reused it, you would be wrong again, just less wrong than tiktoken.

The fix is to never cache a count across a model change. Re-run countTokens

against the model you are actually using. Do not apply a blanket multiplier to convert between models; the divergence is not uniform.

A handy pattern for "how many tokens did this change add" is to count both versions and subtract. The endpoint is stateless, so you just count each and diff:

import { execSync } from "node:child_process";
import fs from "node:fs";

async function count(text: string): Promise<number> {
  const r = await client.messages.countTokens({
    model: "claude-opus-4-8",
    messages: [{ role: "user", content: text }],
  });
  return r.input_tokens;
}

const before = execSync("git show HEAD:CLAUDE.md").toString();
const after = fs.readFileSync("CLAUDE.md", "utf8");
console.log(`Delta: ${(await count(after)) - (await count(before))} tokens`);

I use this to keep an eye on system-prompt bloat. When a prompt creeps up by a few thousand tokens, that is real money on every cached-miss request, and the diff makes it visible.

The tokenizer is part of the model. Borrowing another model's tokenizer to estimate counts is like measuring in the wrong units and hoping the error cancels. It does not cancel; it compounds. Use countTokens

against the exact model, never reuse a count across model versions, and remember output tokens are the unknown that dominates generation cost. It is one API call, it is free, and it is the difference between a budget projection you can trust and one that is off by a fifth.

── more in #large-language-models 4 stories · sorted by recency
── more on @openai 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/token-counting-done-…] indexed:0 read:3min 2026-06-28 ·