The LLM Cost Cliff Your Budget Isn't Ready For

Per-token prices for large language models are collapsing, but AI bills are exploding as reasoning models consume far more tokens per task. Uber burned through a year's AI budget in four months, and Microsoft, Salesforce, and GitHub have capped AI spending. Confusing falling per-unit costs with total budget impact is causing teams to adopt unsustainable cost structures.

AI https://www.devclubhouse.com/c/ai Article The LLM Cost Cliff Your Budget Isn't Ready For Per-token prices are collapsing, yet AI bills keep exploding. The two facts aren't a contradiction, and confusing them will wreck your business case. Priya Nair https://www.devclubhouse.com/u/priya nair Uber reportedly burned through an entire year's AI budget in four months. Microsoft, Salesforce, and GitHub have all moved to rein in what employees spend on AI tooling. Meanwhile a founder casually mentions that running TypeScript type fixes across 50 files with a frontier model cost him $54 in one afternoon. If you're building a product on top of these APIs, those anecdotes should worry you more than any benchmark. Here's the thing the "are LLM costs sustainable?" debate keeps getting wrong: it's actually two questions wearing one coat. Question one is whether the price to run a fixed capability is falling. It is, fast, and it isn't being secretly subsidized. Question two is whether your AI budget and the vendors' flat-rate business models survive contact with real usage. That answer is a much harder no. Conflating the two is how teams talk themselves into a cost structure that quietly detonates 12 months later. Per-token deflation is real, and it's not a subsidy The deflation case is strong, and worth conceding up front because it's true. Epoch AI https://epoch.ai 's work, cited across the cost analyses, puts the drop in inference cost at a fixed quality level at roughly 50x per year by the median estimate, with cost halving about every two months. a16z's Guido Appenzeller named this "LLMflation" and clocked it at about 10x per year, faster than Moore's Law ever moved. The hard numbers back it: GPT-4 launched in March 2023 at $30 per million input tokens; GPT-4.1 delivers that tier of reasoning at $2.00, and GPT-5 mini brings reasoning down to $0.25. xychart-beta title "Input price per million tokens, GPT-4 reasoning tier" x-axis "GPT-4 2023 ", "GPT-4.1", "GPT-5 mini" y-axis "USD / 1M input tokens" 0 -- 30 bar 30, 2, 0.25 The usual rebuttal is that these prices are venture money in disguise. They're not, at least not at the floor. The cleanest proof is the open-weight market on OpenRouter https://openrouter.ai : independent providers hosting open models have no moat and no reason to lose money on inference, so their pricing approximates true cost. DeepSeek hosted that way runs around $0.27 per million input and $1.10 output, well under GPT-4o's $1.25/$10, while scoring better on benchmarks. When unsubsidized commodity hosting undercuts a frontier lab by 4x on output, the floor is real. Cost-per-capability is genuinely cratering, somewhere between 9x and 900x per year depending on which benchmark you fix. So if you only read that side of the ledger, you'd conclude costs are heading to zero and stop worrying. That's the trap. The token bomb eats the deflation Falling price per token doesn't help if each task now burns far more tokens. And it does. Reasoning models don't just answer; they generate thousands of internal "thinking" tokens first. Epoch's data shows reasoning-model output length growing about 5x per year against 2.2x for traditional models. The cost paradox writes itself: per-unit prices fall ~10x a year while per-task consumption climbs fast enough to swamp it. The variance between models for identical output is the part that should change how you pick one. In one benchmark run, a simple model used 7 tokens to answer a question, Claude with thinking used 255, and an aggressive reasoning model spent 603 tokens to say the same thing. Extrapolated across a test suite, that was roughly $9.30 versus $95, a 10x spread for the same answer. Choosing the wrong model isn't a rounding error; it's your margin. Now layer demand growth on top. One worked model from the routing-tools crowd starts a SaaS team at $60,000/month on GPT-4o growing 5% month over month. Naive math says $720K a year. Reality is about $955,000, because usage compounds. Enterprise inference spend hit roughly $8.4 billion by mid-2025, more than doubling in six months even as per-token prices fell . Deflation and your bill are both true at once. They're just not about the same number. The flat-rate subscription is the thing actually dying The vendor economics make the consumer side look even shakier. Estimates peg OpenAI's 2025 revenue near $13 billion against a possible $12 billion loss in a single quarter, with break-even not projected until 2030. Anthropic targets 2028. Training costs explain part of it GPT-3 around $3 million, GPT-4 near $100 million, GPT-5 possibly $2 billion , but those are one-off capex. The structural problem is the recurring cost of serving queries against flat pricing. The math on a $20/month plan is brutal. A simple query might cost the vendor a cent or two; a reasoning query can cost $0.10 to $0.50. By one estimate, two complex questions a day burns the entire value of a $20 subscription, and roughly 98% of ChatGPT users pay nothing at all. Casual users subsidize power users, and power users are exactly who an unlimited plan attracts. Anthropic's Claude Code "Max Unlimited" tried every defense 10x premium pricing at $200/month, autoscaling from Opus down to Sonnet under load, offloading compute to customer machines and still got run over by users consuming billions of tokens. Pay-per-use is the only honest model here, and consumers hate it. They'll happily overpay a fixed fee and revolt at a metered one. That tension, not silicon, is what kills consumer AI margins. What to actually do before you scale The deflation-optimists and the apocalypse crowd both miss the operational point, and it's the one that matters for your roadmap. The savings exist but they don't capture themselves, and the costs explode for reasons inside your own architecture, not the price sheet. Treat the model as a swappable dependency from day one. The one moat traditional software had you can't rip out a CRM in an afternoon does not exist here. A gateway lets you change providers in seconds, even per-request. That's a gift: build to it. Hard-coding one vendor's SDK and pricing assumptions into your core path is the modern equivalent of writing against a single cloud's proprietary API. Route by task, don't migrate by quarter. Manual model switching, done by a disciplined team doing two migrations a year, captures only about 25% of available savings because re-tuning prompts and regression-testing eats months at full price first. Per-prompt routing cheap models for simple calls, frontier models reserved for genuine reasoning captures closer to 60%. On that $60K/month example the gap between the two approaches was over $330,000 a year. The catch is real: you can't drop a reasoning prompt onto a non-reasoning model and expect the same output, so routing requires continuous evaluation, not vibes. Instrument tokens like you instrument latency. Demos are cheap; production is not, and the blowups usually trace to architectural decisions retry storms, bloated context windows, stuffing whole documents into every call rather than the headline per-token rate. If you can't see tokens-per-request per route, you can't control the bill. Don't anchor your unit economics on today's flat rate. If a vendor is losing money per query, your favorable pricing is a temporary subsidy with two exits: efficiency gains passed to you, or price hikes. Plan for the second. The Uber playbook price below cost, capture the market, raise prices once incumbents are dead is the obvious template, and your cost model should survive it. The take The optimists are right that the cost to run a fixed capability is falling and not faked. The pessimists are right that current business cases and flat-rate plans are unsustainable. Both, because they're answering different questions. For a working developer the synthesis is the only thing that pays rent: per-token deflation is a tailwind you only catch if you build for switchability and route per task, while reasoning-token inflation, demand growth, and sloppy architecture are the headwinds that will eat that tailwind and then some. The cost cliff is real. It just isn't where the price-per-token chart is pointing. Design for zero switching cost now, measure your tokens, and never bet your margins on a number a vendor is currently losing money to show you. Sources & further reading - Why current LLM costs are not sustainable https://aditya.patadia.org/p/ai-and-cloud-costs — aditya.patadia.org - Who’s Paying for your Prompt? LLM pricing & sustainability - The Information Difference https://www.informationdifference.com/whos-paying-for-your-prompt/ — informationdifference.com - The LLM Cost Paradox: How "Cheaper" AI Models Are Breaking Budgets https://www.ikangai.com/the-llm-cost-paradox-how-cheaper-ai-models-are-breaking-budgets/ — ikangai.com - No, the LLM costs are not going up – Simian Words https://simianwords.bearblog.dev/conclusive-proofs-that-llm-costs-are-going-down/ — simianwords.bearblog.dev - How to Reduce LLM Costs: The Hidden Cost of LLMflation and Model Inertia | Divyam.AI https://divyam.ai/blog/hidden-cost-of-llmflation/ — divyam.ai Priya Nair https://www.devclubhouse.com/u/priya nair · AI & Developer Experience Writer Priya covers AI frameworks, developer productivity tooling, and the startup ecosystem across South and Southeast Asia, bringing a researcher's rigour and a practitioner's empathy to every story. She is deeply sceptical of benchmarks and asks hard questions so her readers don't have to. Discussion 0 No comments yet Be the first to weigh in.