{"slug": "the-ai-token-plumbing-issue", "title": "The AI Token plumbing issue", "summary": "Lago released the Agent SDK, a set of Python and TypeScript libraries that wrap LLM clients to automatically capture token usage data for billing purposes. The tool addresses the widespread problem of teams building custom middleware to extract token counts from different AI providers, which breaks whenever provider schemas change. By standardizing token attribution across OpenAI, Anthropic, and Bedrock, the SDK aims to help both B2B SaaS companies and AI-native startups track variable costs per customer without maintaining separate billing infrastructure.", "body_md": "Product\n\n###### AI Billing is (mostly) token plumbing\n\nRaffi Sarkissian • 5 min read\n\nMay 26\n\n/5 min read\n\n*Why we built the Lago Agent SDK, and what we're shipping next.*\n\nWe just released the Lago Agent SDK. Two libraries, Python and TypeScript. They wrap your LLM client and send token usage to Lago for billing. That's the surface.\n\nThe point is what you stop doing.\n\nEvery team that shipped an AI feature in the last 18 months built the same thing. Smart search, inbox triage, meeting summaries, coding agents, vibe-coded apps. All of them ended up writing token-extraction middleware.\n\nThe middleware is the same job, repeated everywhere. Call an LLM. Parse the response for token counts. Attribute the call to a customer. Send the count to a billing system. Repeat for every provider, every model family, every streaming response, every retry, every cached call.\n\nEvery provider returns usage in a different shape:\n\n```\nopenai_resp.usage.prompt_tokens\nanthropic_resp.usage.input_tokens          # plus cache_creation_input_tokens, cache_read_input_tokens\nbedrock_resp[\"usage\"][\"inputTokens\"]       # camelCase, dict access, no cache fields at this level\n```\n\nCache tokens have sub-types. Streaming responses bury usage in the last event, sometimes. Reasoning tokens are folded into output on some models, broken out on others. The schemas change every quarter.\n\nThis is the token plumbing. Not differentiating, not what your AI feature is for, and it breaks every time a provider ships an update.\n\nThe B2B SaaS team adding AI to an existing product. Intercom shipping Fin on top of seat-based pricing. Notion layering AI as a per-seat add-on. Atlassian Intelligence rolling out across Jira and Confluence. The team has billed per-seat for years and now needs to charge for inference-backed features without rewriting the engine. Product wants AI live in two weeks. Engineering owns a sidecar nobody wants to maintain. The CFO wants to know if the feature has positive margin. Nobody can answer cleanly because token data lives in logs, not invoices.\n\nThe AI-native team building on top of LLMs. Cursor, Lovable, Replit, voice and browser agents. They pay a per-token rate to a model provider and bill the user with margin on top. Cost-plus, end to end. Every point of margin matters because COGS is variable per-customer and tracked in real time. Under-count and they bleed margin. Over-count and they lose trust. The middleware has to be exact, every release, for every model they add.\n\nBoth groups built the same plumbing. We're tired of building it.\n\nBefore, billing an LLM call looked something like this.\n\n```\nresp = client.converse(modelId=\"...\", messages=[...])\n\nusage = resp[\"usage\"]\nbilling.send_event(customer_id, \"llm_input_tokens\",  usage[\"inputTokens\"])\nbilling.send_event(customer_id, \"llm_output_tokens\", usage[\"outputTokens\"])\nbilling.send_event(customer_id, \"llm_cache_read\",    usage.get(\"cacheReadInputTokens\", 0))\n# ... repeat for cache writes, tool calls, reasoning tokens, streaming chunks\n# ... then write it all again, differently, for the next provider you add\n```\n\nAfter, you wrap the client once.\n\n```\n# OpenAI\nclient = sdk.wrap(OpenAI())\nclient.chat.completions.create(model=\"gpt-4o\", messages=[...])\n\n# Anthropic\nclient = sdk.wrap(Anthropic())\nclient.messages.create(model=\"claude-sonnet-4-5\", messages=[...])\n\n# Bedrock\nclient = sdk.wrap(boto3.client(\"bedrock-runtime\"))\nclient.converse(modelId=\"...\", messages=[...])\n\n# token attribution happens automatically, per customer, across every provider\n```\n\nWhat lands in billing tells the story.\n\nOld world. Anthropic returns one shape:\n\n```\n{\n  \"model\": \"claude-sonnet-4-5\",\n  \"usage\": {\n    \"input_tokens\": 1200,\n    \"output_tokens\": 340,\n    \"cache_creation_input_tokens\": 800,\n    \"cache_read_input_tokens\": 4000\n  }\n}\n```\n\nOpenAI returns another:\n\n```\n{\n  \"model\": \"gpt-4o\",\n  \"usage\": {\n    \"prompt_tokens\": 1200,\n    \"completion_tokens\": 340,\n    \"prompt_tokens_details\": { \"cached_tokens\": 4000 }\n  }\n}\n```\n\nDifferent field names, different nesting, different cache semantics. You write one extractor per provider, map the fields, send one event per dimension. Then a model adds a new field and you do it again. New world. The SDK normalizes both into the same canonical shape and batches them to Lago:\n\n```\n{\n  \"external_subscription_id\": \"sub_acme\",\n  \"events\": [\n    { \"code\": \"llm_input_tokens\",         \"properties\": { \"value\": 1200 } },\n    { \"code\": \"llm_output_tokens\",        \"properties\": { \"value\": 340  } },\n    { \"code\": \"llm_cached_input_tokens\",  \"properties\": { \"value\": 4000 } },\n    { \"code\": \"llm_cache_creation_tokens\",\"properties\": { \"value\": 800  } }\n  ]\n}\n```\n\nSame event shape regardless of provider. Customer attribution is automatic. Cache fields populate when the provider returns them, stay absent when it doesn't.\n\nThe wrapped client behaves identically to the original. Same arguments, same return shape, same exceptions. The SDK extracts usage from every response, normalizes it across providers, attributes it to a customer subscription, and streams events to Lago in batches. Overhead in the low milliseconds. If anything in the SDK fails, the LLM call still returns.\n\nNo migration. The application calls the model the same way it did yesterday.\n\nMost teams have infrastructure around their LLM calls. Edge proxies for caching repeated prompts. AI gateways for fallback routing and rate limits. Observability layers for latency and error tracking. Edge inference hosts for region-locality. These layers protect margin and user experience.\n\nThe SDK composes with them. It runs in your application process, alongside whatever you already use. If your stack runs through Cloudflare AI Gateway, the Gateway keeps doing its job and the SDK reads the response that comes back through it. Same for Bedrock with API Gateway in front, an edge setup on Workers AI, or a self-hosted LiteLLM proxy.\n\nTwo layers, two jobs. Your existing stack knows about your traffic: what got cached, what got retried, what was slow. The SDK knows about your customers: which subscription this call belongs to, what feature it was billed against, what margin tier the customer is on. Caching savings show up in your cost line. Token counts show up on the customer's invoice. Both layers see the same response, so the math agrees.\n\nThe SDK gets tokens out of the response and into billing. It does not yet tell you what those tokens cost.\n\nIf you're billing cost-plus today, you maintain your own pricing table. Per-model input rate. Per-model output rate. Cache read and cache write with separate TTL tiers. Long-context surcharges. Reasoning tokens. The table moves every time a provider posts a blog. You're updating a YAML file in your repo and hoping nobody forgot the last change.\n\nThe next thing we're shipping is the table itself. Lago maintains current per-model pricing for every major provider. You set a markup. We compute cost from the token counts the SDK already captures, apply your margin, and charge the customer. You stop tracking provider price changes. You stop reconciling cost-plus math at month-end.\n\nFor AI-native teams, that's pass-through cost with a clean markup, kept honest by infrastructure that updates when the providers update. For B2B SaaS adding AI features, the same table answers the margin question the CFO keeps asking, without anyone maintaining a spreadsheet.\n\nThe gap between \"the LLM returned tokens\" and \"the customer got billed for tokens.\" Every customer-facing team building AI owns it. Most have a half-finished plan to extend it for the next provider.\n\nIt's the most code per dollar of value of anything in your stack. Someone has to own it. It should not be every team in the industry, in parallel, separately, forever.\n\nThe libraries are on GitHub today.\n\n`getlago/lago-agent-sdk-python`\n\n`getlago/lago-agent-sdk-js`\n\n`docs.getlago.com/guide/ai-agents/agent-sdk`", "url": "https://wpnews.pro/news/the-ai-token-plumbing-issue", "canonical_source": "https://getlago.com/blog/ai-billing-is-mostly-token-plumbing", "published_at": "2026-05-26 23:13:41+00:00", "updated_at": "2026-05-26 23:37:51.672781+00:00", "lang": "en", "topics": ["ai-infrastructure", "ai-tools", "ai-products", "large-language-models", "ai-agents"], "entities": ["Lago", "Lago Agent SDK", "Raffi Sarkissian", "OpenAI", "Anthropic", "AWS Bedrock"], "alternates": {"html": "https://wpnews.pro/news/the-ai-token-plumbing-issue", "markdown": "https://wpnews.pro/news/the-ai-token-plumbing-issue.md", "text": "https://wpnews.pro/news/the-ai-token-plumbing-issue.txt", "jsonld": "https://wpnews.pro/news/the-ai-token-plumbing-issue.jsonld"}}