cd /news/large-language-models/tracking-token-usage-across-openai-a… · home topics large-language-models article
[ARTICLE · art-34730] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=· neutral

Tracking token usage across OpenAI, Anthropic, and Gemini: every streaming gotcha I hit

A developer building Spanlens, an open-source LLM observability tool, found that OpenAI, Anthropic, and Gemini report token usage differently during streaming, requiring distinct parsing logic for each provider. OpenAI places usage in a final chunk only if requested, Anthropic splits input and output tokens across two events, and Gemini has two stream formats. Additionally, prompt caching accounting differs: OpenAI includes cached tokens in prompt_tokens, while Anthropic reports them separately.

read6 min views1 publishedJun 20, 2026

OpenAI, Anthropic, and Gemini each report token usage differently, and it stops being trivia the moment you track LLM cost. I build Spanlens, an open-source LLM observability tool that sits in front of all three as a proxy and records every call with its model, latency, tokens, and cost. To do the cost part I read the token usage back out of every response, including the streaming ones.

I assumed the three providers would report usage in roughly the same way. They send the same kind of data, after all: input tokens, output tokens, maybe a cached count. How different could it be.

Pretty different, it turns out. Here is the whole thing in one table, then each gotcha in detail with the real parser code from the repo.

Provider Where usage lives (streaming) Cache accounting Field names
OpenAI final chunk, needs stream_options: { include_usage: true }
prompt_tokens includes cache
prompt_tokens / completion_tokens
Anthropic split across message_start + message_delta
input_tokens excludes cache, so add it
input_tokens / output_tokens
Gemini
usageMetadata , two stream formats
not applicable
promptTokenCount / candidatesTokenCount

For a non-streaming call this is boring. Every provider hands you a usage

object on the response body and you read it. Streaming is where it gets weird, because the token counts are not in the content chunks. They show up somewhere else, and "somewhere else" is different for each provider.

OpenAI puts the usage in a final chunk, after all the content, right before [DONE]

. You only get it if you ask for it with stream_options: { include_usage: true }

. Miss that flag and you stream the whole response and end up with no usage at all.

export function parseOpenAIStreamChunk(line: string): Partial<ParsedUsage> | null {
  if (!line.startsWith('data: ')) return null
  const data = line.slice(6).trim()
  if (data === '[DONE]') return null
  const json = JSON.parse(data)
  const usage = json.usage
  if (!usage) return null   // most chunks land here; only the last one has usage
  return {
    promptTokens: usage.prompt_tokens ?? 0,
    completionTokens: usage.completion_tokens ?? 0,
    totalTokens: usage.total_tokens ?? 0,
  }
}

Anthropic splits it across two different events. The input tokens come early, in message_start

. The output tokens come at the end, in message_delta

. If you only listen for one event, half your number is missing.

// input side: arrives in message_start
if (json.type === 'message_start') {
  const usage = json.message?.usage
  return { promptTokens: usage.input_tokens /* + cache, see below */ }
}

// output side: arrives later, in message_delta
if (json.type === 'message_delta') {
  return { completionTokens: json.usage?.output_tokens ?? 0 }
}

So for OpenAI I keep the last chunk, and for Anthropic I have to stitch together the first event and a later one. Two providers, two mental models, already.

This is the one that can quietly corrupt your cost numbers, so it is worth slowing down on.

Both OpenAI and Anthropic support prompt caching, and both report a cached-token count. The trap is what the "input tokens" number means relative to that cached count.

On OpenAI, prompt_tokens

already includes the cached tokens. The cached count is a subset of it. If you want the uncached portion you subtract.

On Anthropic, input_tokens

is the uncached portion only. The cached tokens are reported separately and are not in that number. To get the real total you add them up.

Same idea, opposite math. Here is how I normalize Anthropic so that my promptTokens

column always means "total input including cache" no matter which provider it came from:

const inputTokens = usage.input_tokens ?? 0
const cacheRead = usage.cache_read_input_tokens ?? 0
const cacheWrite = usage.cache_creation_input_tokens ?? 0
const promptTokens = inputTokens + cacheRead + cacheWrite   // Anthropic: add

And OpenAI, where the cached count is already inside prompt_tokens

:

const promptTokens = usage.prompt_tokens ?? 0               // OpenAI: already total
const cacheReadTokens = usage.prompt_tokens_details?.cached_tokens ?? 0  // subset

If you write one function and feed both providers through it without thinking about this, you do not get an error. You get a cost number that is wrong by the size of the cache, and cache hits are exactly the high-volume calls where the error is largest. Wrong financial data that never throws is the worst kind of bug, so I now treat the cache convention as a per-provider fact I have to look up rather than guess.

OpenAI and Anthropic both stream server-sent events, lines that start with data:

. Gemini can do that too, but only if you append ?alt=sse

to the URL. Without it, the default streamGenerateContent

endpoint streams a single giant JSON array, one big [ ... ]

delivered character by character.

So a Gemini stream parser has to handle both. Mine tries SSE first, then falls back to parsing the buffer as a JSON array, then falls back again to scanning line by line for anything that looks like a chunk:

// 1. SSE form ("data: {json}")
for (const line of lines) {
  if (line.startsWith('data: ')) appendTextFromGeminiChunk(line.slice(6).trim(), parts)
}
// 2. default form: one JSON array streamed char by char
const joined = lines.join('\n').trim()
if (parts.length === 0 && joined.startsWith('[')) {
  for (const item of JSON.parse(joined)) appendTextFromGeminiChunk(JSON.stringify(item), parts)
}

The field names are different too. OpenAI gives you prompt_tokens

and completion_tokens

. Gemini gives you promptTokenCount

and candidatesTokenCount

inside a usageMetadata

object. None of it lines up, so the normalizer earns its keep.

All three providers can report a service tier (default, flex, priority, and so on), and the cost depends on it. The thing to know is that the tier in the response is the tier they actually served, which is not always the one you requested. OpenAI can downgrade a priority request to default under load, and that downgrade only shows up in the response. So I always trust the served tier from the response over whatever the request asked for, because that is what you are billed on.

Gemini also reports the tier with inconsistent casing, sometimes a plain flex

, sometimes a ..._FLEX

screaming-snake constant, so that needed its own small coercion step.

If you are normalizing usage across providers, do not write the shared function first. Write one parser per provider, get each one right against real responses, and only then collapse them behind a common shape. The differences are not cosmetic. Where the number lives, whether cache is included, and what the field is called all change per provider, and a single early abstraction hides exactly the parts that differ.

The other lesson is to assert on cost-bearing numbers loudly. Type errors get caught on the first request in dev. A token count that is off by the cache size ships silently and shows up as a billing discrepancy weeks later. That asymmetry is worth a test.

All of this lives in apps/server/src/parsers/

in the repo if you want to see the full versions, including the streaming reassembly and the tier handling I trimmed here.

This is the second gotcha writeup from building Spanlens. The first was on moving LLM logs from Postgres to ClickHouse, if you are weighing that migration.

Spanlens is open source (MIT). If you want the token, cost, and latency of every LLM call logged with a one-line baseURL swap, you can try it free or self-host it with one Docker command.

If this saved you a debugging session, a star on GitHub genuinely helps other people find it.

What gotchas have you hit normalizing usage or cost across providers? I would like to hear them.

── more in #large-language-models 4 stories · sorted by recency
── more on @openai 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/tracking-token-usage…] indexed:0 read:6min 2026-06-20 ·