Tracking token usage across OpenAI, Anthropic, and Gemini: every streaming gotcha I hit

A developer building Spanlens, an open-source LLM observability tool, found that OpenAI, Anthropic, and Gemini report token usage differently during streaming, requiring distinct parsing logic for each provider. OpenAI places usage in a final chunk only if requested, Anthropic splits input and output tokens across two events, and Gemini has two stream formats. Additionally, prompt caching accounting differs: OpenAI includes cached tokens in prompt_tokens, while Anthropic reports them separately.

OpenAI, Anthropic, and Gemini each report token usage differently, and it stops being trivia the moment you track LLM cost. I build Spanlens, an open-source LLM observability tool that sits in front of all three as a proxy and records every call with its model, latency, tokens, and cost. To do the cost part I read the token usage back out of every response, including the streaming ones. I assumed the three providers would report usage in roughly the same way. They send the same kind of data, after all: input tokens, output tokens, maybe a cached count. How different could it be. Pretty different, it turns out. Here is the whole thing in one table, then each gotcha in detail with the real parser code from the repo. | Provider | Where usage lives streaming | Cache accounting | Field names | |---|---|---|---| | OpenAI | final chunk, needs stream options: { include usage: true } | prompt tokens includes cache | prompt tokens / completion tokens | | Anthropic | split across message start + message delta | input tokens excludes cache, so add it | input tokens / output tokens | | Gemini | usageMetadata , two stream formats | not applicable | promptTokenCount / candidatesTokenCount | For a non-streaming call this is boring. Every provider hands you a usage object on the response body and you read it. Streaming is where it gets weird, because the token counts are not in the content chunks. They show up somewhere else, and "somewhere else" is different for each provider. OpenAI puts the usage in a final chunk, after all the content, right before DONE . You only get it if you ask for it with stream options: { include usage: true } . Miss that flag and you stream the whole response and end up with no usage at all. export function parseOpenAIStreamChunk line: string : Partial<ParsedUsage | null { if line.startsWith 'data: ' return null const data = line.slice 6 .trim if data === ' DONE ' return null const json = JSON.parse data const usage = json.usage if usage return null // most chunks land here; only the last one has usage return { promptTokens: usage.prompt tokens ?? 0, completionTokens: usage.completion tokens ?? 0, totalTokens: usage.total tokens ?? 0, } } Anthropic splits it across two different events. The input tokens come early, in message start . The output tokens come at the end, in message delta . If you only listen for one event, half your number is missing. // input side: arrives in message start if json.type === 'message start' { const usage = json.message?.usage return { promptTokens: usage.input tokens / + cache, see below / } } // output side: arrives later, in message delta if json.type === 'message delta' { return { completionTokens: json.usage?.output tokens ?? 0 } } So for OpenAI I keep the last chunk, and for Anthropic I have to stitch together the first event and a later one. Two providers, two mental models, already. This is the one that can quietly corrupt your cost numbers, so it is worth slowing down on. Both OpenAI and Anthropic support prompt caching, and both report a cached-token count. The trap is what the "input tokens" number means relative to that cached count. On OpenAI, prompt tokens already includes the cached tokens. The cached count is a subset of it. If you want the uncached portion you subtract. On Anthropic, input tokens is the uncached portion only. The cached tokens are reported separately and are not in that number. To get the real total you add them up. Same idea, opposite math. Here is how I normalize Anthropic so that my promptTokens column always means "total input including cache" no matter which provider it came from: js const inputTokens = usage.input tokens ?? 0 const cacheRead = usage.cache read input tokens ?? 0 const cacheWrite = usage.cache creation input tokens ?? 0 const promptTokens = inputTokens + cacheRead + cacheWrite // Anthropic: add And OpenAI, where the cached count is already inside prompt tokens : js const promptTokens = usage.prompt tokens ?? 0 // OpenAI: already total const cacheReadTokens = usage.prompt tokens details?.cached tokens ?? 0 // subset If you write one function and feed both providers through it without thinking about this, you do not get an error. You get a cost number that is wrong by the size of the cache, and cache hits are exactly the high-volume calls where the error is largest. Wrong financial data that never throws is the worst kind of bug, so I now treat the cache convention as a per-provider fact I have to look up rather than guess. OpenAI and Anthropic both stream server-sent events, lines that start with data: . Gemini can do that too, but only if you append ?alt=sse to the URL. Without it, the default streamGenerateContent endpoint streams a single giant JSON array, one big ... delivered character by character. So a Gemini stream parser has to handle both. Mine tries SSE first, then falls back to parsing the buffer as a JSON array, then falls back again to scanning line by line for anything that looks like a chunk: js // 1. SSE form "data: {json}" for const line of lines { if line.startsWith 'data: ' appendTextFromGeminiChunk line.slice 6 .trim , parts } // 2. default form: one JSON array streamed char by char const joined = lines.join '\n' .trim if parts.length === 0 && joined.startsWith ' ' { for const item of JSON.parse joined appendTextFromGeminiChunk JSON.stringify item , parts } The field names are different too. OpenAI gives you prompt tokens and completion tokens . Gemini gives you promptTokenCount and candidatesTokenCount inside a usageMetadata object. None of it lines up, so the normalizer earns its keep. All three providers can report a service tier default, flex, priority, and so on , and the cost depends on it. The thing to know is that the tier in the response is the tier they actually served, which is not always the one you requested. OpenAI can downgrade a priority request to default under load, and that downgrade only shows up in the response. So I always trust the served tier from the response over whatever the request asked for, because that is what you are billed on. Gemini also reports the tier with inconsistent casing, sometimes a plain flex , sometimes a ... FLEX screaming-snake constant, so that needed its own small coercion step. If you are normalizing usage across providers, do not write the shared function first. Write one parser per provider, get each one right against real responses, and only then collapse them behind a common shape. The differences are not cosmetic. Where the number lives, whether cache is included, and what the field is called all change per provider, and a single early abstraction hides exactly the parts that differ. The other lesson is to assert on cost-bearing numbers loudly. Type errors get caught on the first request in dev. A token count that is off by the cache size ships silently and shows up as a billing discrepancy weeks later. That asymmetry is worth a test. All of this lives in apps/server/src/parsers/ in the repo if you want to see the full versions, including the streaming reassembly and the tier handling I trimmed here. This is the second gotcha writeup from building Spanlens. The first was on moving LLM logs from Postgres to ClickHouse https://dev.to/spanlens/5-gotchas-i-hit-moving-llm-logs-from-postgres-to-clickhouse-2458 , if you are weighing that migration. Spanlens is open source MIT . If you want the token, cost, and latency of every LLM call logged with a one-line baseURL swap, you can try it free https://www.spanlens.io or self-host it with one Docker command. If this saved you a debugging session, a star on GitHub https://github.com/spanlens/spanlens genuinely helps other people find it. What gotchas have you hit normalizing usage or cost across providers? I would like to hear them.