{"slug": "tracking-token-usage-across-openai-anthropic-and-gemini-every-streaming-gotcha-i", "title": "Tracking token usage across OpenAI, Anthropic, and Gemini: every streaming gotcha I hit", "summary": "A developer building Spanlens, an open-source LLM observability tool, found that OpenAI, Anthropic, and Gemini report token usage differently during streaming, requiring distinct parsing logic for each provider. OpenAI places usage in a final chunk only if requested, Anthropic splits input and output tokens across two events, and Gemini has two stream formats. Additionally, prompt caching accounting differs: OpenAI includes cached tokens in prompt_tokens, while Anthropic reports them separately.", "body_md": "OpenAI, Anthropic, and Gemini each report token usage differently, and it stops being trivia the moment you track LLM cost. I build Spanlens, an open-source LLM observability tool that sits in front of all three as a proxy and records every call with its model, latency, tokens, and cost. To do the cost part I read the token usage back out of every response, including the streaming ones.\n\nI assumed the three providers would report usage in roughly the same way. They send the same kind of data, after all: input tokens, output tokens, maybe a cached count. How different could it be.\n\nPretty different, it turns out. Here is the whole thing in one table, then each gotcha in detail with the real parser code from the repo.\n\n| Provider | Where usage lives (streaming) | Cache accounting | Field names |\n|---|---|---|---|\n| OpenAI | final chunk, needs `stream_options: { include_usage: true }`\n|\n`prompt_tokens` includes cache |\n`prompt_tokens` / `completion_tokens`\n|\n| Anthropic | split across `message_start` + `message_delta`\n|\n`input_tokens` excludes cache, so add it |\n`input_tokens` / `output_tokens`\n|\n| Gemini |\n`usageMetadata` , two stream formats |\nnot applicable |\n`promptTokenCount` / `candidatesTokenCount`\n|\n\nFor a non-streaming call this is boring. Every provider hands you a `usage`\n\nobject on the response body and you read it. Streaming is where it gets weird, because the token counts are not in the content chunks. They show up somewhere else, and \"somewhere else\" is different for each provider.\n\nOpenAI puts the usage in a final chunk, after all the content, right before `[DONE]`\n\n. You only get it if you ask for it with `stream_options: { include_usage: true }`\n\n. Miss that flag and you stream the whole response and end up with no usage at all.\n\n```\nexport function parseOpenAIStreamChunk(line: string): Partial<ParsedUsage> | null {\n  if (!line.startsWith('data: ')) return null\n  const data = line.slice(6).trim()\n  if (data === '[DONE]') return null\n  const json = JSON.parse(data)\n  const usage = json.usage\n  if (!usage) return null   // most chunks land here; only the last one has usage\n  return {\n    promptTokens: usage.prompt_tokens ?? 0,\n    completionTokens: usage.completion_tokens ?? 0,\n    totalTokens: usage.total_tokens ?? 0,\n  }\n}\n```\n\nAnthropic splits it across two different events. The input tokens come early, in `message_start`\n\n. The output tokens come at the end, in `message_delta`\n\n. If you only listen for one event, half your number is missing.\n\n```\n// input side: arrives in message_start\nif (json.type === 'message_start') {\n  const usage = json.message?.usage\n  return { promptTokens: usage.input_tokens /* + cache, see below */ }\n}\n\n// output side: arrives later, in message_delta\nif (json.type === 'message_delta') {\n  return { completionTokens: json.usage?.output_tokens ?? 0 }\n}\n```\n\nSo for OpenAI I keep the last chunk, and for Anthropic I have to stitch together the first event and a later one. Two providers, two mental models, already.\n\nThis is the one that can quietly corrupt your cost numbers, so it is worth slowing down on.\n\nBoth OpenAI and Anthropic support prompt caching, and both report a cached-token count. The trap is what the \"input tokens\" number means relative to that cached count.\n\nOn OpenAI, `prompt_tokens`\n\nalready includes the cached tokens. The cached count is a subset of it. If you want the uncached portion you subtract.\n\nOn Anthropic, `input_tokens`\n\nis the uncached portion only. The cached tokens are reported separately and are not in that number. To get the real total you add them up.\n\nSame idea, opposite math. Here is how I normalize Anthropic so that my `promptTokens`\n\ncolumn always means \"total input including cache\" no matter which provider it came from:\n\n``` js\nconst inputTokens = usage.input_tokens ?? 0\nconst cacheRead = usage.cache_read_input_tokens ?? 0\nconst cacheWrite = usage.cache_creation_input_tokens ?? 0\nconst promptTokens = inputTokens + cacheRead + cacheWrite   // Anthropic: add\n```\n\nAnd OpenAI, where the cached count is already inside `prompt_tokens`\n\n:\n\n``` js\nconst promptTokens = usage.prompt_tokens ?? 0               // OpenAI: already total\nconst cacheReadTokens = usage.prompt_tokens_details?.cached_tokens ?? 0  // subset\n```\n\nIf you write one function and feed both providers through it without thinking about this, you do not get an error. You get a cost number that is wrong by the size of the cache, and cache hits are exactly the high-volume calls where the error is largest. Wrong financial data that never throws is the worst kind of bug, so I now treat the cache convention as a per-provider fact I have to look up rather than guess.\n\nOpenAI and Anthropic both stream server-sent events, lines that start with `data:`\n\n. Gemini can do that too, but only if you append `?alt=sse`\n\nto the URL. Without it, the default `streamGenerateContent`\n\nendpoint streams a single giant JSON array, one big `[ ... ]`\n\ndelivered character by character.\n\nSo a Gemini stream parser has to handle both. Mine tries SSE first, then falls back to parsing the buffer as a JSON array, then falls back again to scanning line by line for anything that looks like a chunk:\n\n``` js\n// 1. SSE form (\"data: {json}\")\nfor (const line of lines) {\n  if (line.startsWith('data: ')) appendTextFromGeminiChunk(line.slice(6).trim(), parts)\n}\n// 2. default form: one JSON array streamed char by char\nconst joined = lines.join('\\n').trim()\nif (parts.length === 0 && joined.startsWith('[')) {\n  for (const item of JSON.parse(joined)) appendTextFromGeminiChunk(JSON.stringify(item), parts)\n}\n```\n\nThe field names are different too. OpenAI gives you `prompt_tokens`\n\nand `completion_tokens`\n\n. Gemini gives you `promptTokenCount`\n\nand `candidatesTokenCount`\n\ninside a `usageMetadata`\n\nobject. None of it lines up, so the normalizer earns its keep.\n\nAll three providers can report a service tier (default, flex, priority, and so on), and the cost depends on it. The thing to know is that the tier in the response is the tier they actually served, which is not always the one you requested. OpenAI can downgrade a priority request to default under load, and that downgrade only shows up in the response. So I always trust the served tier from the response over whatever the request asked for, because that is what you are billed on.\n\nGemini also reports the tier with inconsistent casing, sometimes a plain `flex`\n\n, sometimes a `..._FLEX`\n\nscreaming-snake constant, so that needed its own small coercion step.\n\nIf you are normalizing usage across providers, do not write the shared function first. Write one parser per provider, get each one right against real responses, and only then collapse them behind a common shape. The differences are not cosmetic. Where the number lives, whether cache is included, and what the field is called all change per provider, and a single early abstraction hides exactly the parts that differ.\n\nThe other lesson is to assert on cost-bearing numbers loudly. Type errors get caught on the first request in dev. A token count that is off by the cache size ships silently and shows up as a billing discrepancy weeks later. That asymmetry is worth a test.\n\nAll of this lives in `apps/server/src/parsers/`\n\nin the repo if you want to see the full versions, including the streaming reassembly and the tier handling I trimmed here.\n\nThis is the second gotcha writeup from building Spanlens. The first was on [moving LLM logs from Postgres to ClickHouse](https://dev.to/spanlens/5-gotchas-i-hit-moving-llm-logs-from-postgres-to-clickhouse-2458), if you are weighing that migration.\n\n**Spanlens is open source (MIT).** If you want the token, cost, and latency of every LLM call logged with a one-line baseURL swap, you can [try it free](https://www.spanlens.io) or self-host it with one Docker command.\n\nIf this saved you a debugging session, a [star on GitHub](https://github.com/spanlens/spanlens) genuinely helps other people find it.\n\nWhat gotchas have you hit normalizing usage or cost across providers? I would like to hear them.", "url": "https://wpnews.pro/news/tracking-token-usage-across-openai-anthropic-and-gemini-every-streaming-gotcha-i", "canonical_source": "https://dev.to/spanlens/tracking-token-usage-across-openai-anthropic-and-gemini-every-streaming-gotcha-i-hit-4mf3", "published_at": "2026-06-20 09:33:40+00:00", "updated_at": "2026-06-20 09:36:36.168774+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "ai-infrastructure"], "entities": ["OpenAI", "Anthropic", "Gemini", "Spanlens"], "alternates": {"html": "https://wpnews.pro/news/tracking-token-usage-across-openai-anthropic-and-gemini-every-streaming-gotcha-i", "markdown": "https://wpnews.pro/news/tracking-token-usage-across-openai-anthropic-and-gemini-every-streaming-gotcha-i.md", "text": "https://wpnews.pro/news/tracking-token-usage-across-openai-anthropic-and-gemini-every-streaming-gotcha-i.txt", "jsonld": "https://wpnews.pro/news/tracking-token-usage-across-openai-anthropic-and-gemini-every-streaming-gotcha-i.jsonld"}}