The AI Cost Paradox: 280x Cheaper, Bills Still Rising

The cost of running a capable AI model fell by roughly 280 times between November 2022 and October 2024, yet the average company's AI bill rose more than 300% over the same period. The paradox arises because cheaper tokens lead to increased usage, with agents and retrieval-augmented generation driving far more model calls per task. This structural pattern mirrors historical efficiency gains in computing, where lower unit costs expand consumption rather than reduce spending.

The cost of running a capable AI model fell by roughly 280 times in two years. Over the same stretch, the average company's AI bill went up, not down. Both numbers are real, both come from credible research, and the space between them is the single most useful thing an operator can understand about AI economics in 2026. It explains why "the models keep getting cheaper" and "our AI spend is out of control" are being said in the same meeting, by the same people, about the same systems. I watch this play out in client projects every month. Someone reads that token prices collapsed, assumes their costs are about to fall off a cliff, and then opens an invoice that did the opposite. The confusion is not a billing error. It is a structural feature of how AI is now built, and once you see the mechanism you can plan around it instead of being surprised by it. Start with the collapse, because it is genuinely staggering. Stanford's 2026 AI Index https://hai.stanford.edu/ai-index/2026-ai-index-report pegs the price of GPT-3.5-level performance at about 280 times cheaper between November 2022 and October 2024, falling from roughly 20 dollars per million tokens to about 7 cents. That is not a typo and it is not a one-off. Epoch AI measures a median decline near 50 times per year for equal capability. The venture firm a16z frames the same trend more conservatively at around 10 times per year, which they point out is still faster than compute fell in the PC era or bandwidth fell during the dotcom build-out. The frontier did the same thing in public. When Anthropic shipped Claude Opus 4.5 in November 2025, it cut the flagship price from 15 and 75 dollars per million input and output tokens to 5 and 25, a 67 percent reduction in a single release. What happened next is the part people miss. Anthropic then held that 5-and-25 price across Opus 4.6, 4.7, and 4.8 while the model kept getting better. The per-token price stopped falling and capability kept climbing, which is its own kind of price cut. The trigger for most of this was competition from below. DeepSeek R1 landed in January 2025 at 55 cents per million tokens while scoring around 95 percent of OpenAI's o1, and the major labs responded with emergency price moves. By mid-2026 the floor is remarkable. OpenAI's GPT-5.4-nano runs at 20 cents input and 1.25 dollars output per million. DeepSeek V4 Pro, an open-weights model you can host yourself, sits near 44 cents input. Google's Gemini 3.5 Flash beats the previous generation's Pro tier on agent benchmarks at 1.50 and 9 dollars. On paper, intelligence has never been this cheap to rent. Here is the paradox stated plainly. Per-token prices fell by a factor of hundreds, and by one estimate the average enterprise AI bill still rose more than 300 percent over the same window. I treat the exact magnitude of that spend figure as indicative rather than gospel, because it comes from a secondary analysis, but the direction is confirmed everywhere and the reason is structural, not accidental. Cheaper tokens get spent, not saved. The thing you are buying changed shape. In 2023 a typical interaction was one prompt and one answer, a few thousand tokens, one model call. In 2026 the same business outcome runs through an agent that fires somewhere between 10 and 20 model calls for a single user task. It plans, it calls a tool, it reads the result, it re-plans, it checks its own work, it writes a commit message. Retrieval-augmented generation inflates the context of each of those calls by stuffing in three to five times more reference text. And the agent does not go home at night. Monitoring agents and always-on assistants bill around the clock whether anyone is watching or not. So the unit got 280 times cheaper and the number of units per job went up by more than that. This is the same pattern every efficiency gain in computing has followed. Cheaper storage did not shrink data centers, it gave us video everywhere. Cheaper bandwidth did not lower the average person's internet bill, it gave us streaming. Cheaper intelligence is not lowering AI spend, it is making agents economically possible, and agents are hungry. For anyone running a product on top of an API, that is the line that matters: a workload that cost a cent yesterday is a loop that costs fifteen cents today, and the loop is what makes the product good. If you want a single event that marks the turn, it is GitHub Copilot. On the first of June 2026, GitHub moved every Copilot plan to usage-based billing. Premium request units were replaced by AI Credits priced at one cent each, metered against input, output, and cached tokens at each model's published rate. The cheaper fallback model that used to absorb overflow is gone. When your credits run out you either set a budget or you stop. The reason GitHub gave is the clearest sentence anyone has written about this whole shift. With agents and subagents in the picture, the company said, "it is now common for a handful of requests to incur costs that exceed the plan price." Read that again with your own product in mind. A flat monthly subscription assumes a roughly predictable amount of work per user. Agentic software breaks that assumption, because one motivated user pointing an agent at a hard problem can burn a month of margin in an afternoon. Everyone building on these APIs is now living in the world GitHub just formalized. Providers split pricing into short-context and long-context tiers. They charge per tool call for search and computer use. They sell priority lanes at 2.5 times the base rate and offer cached-input discounts up to 90 percent to reward architectures that reuse prompts. The flat-rate, all-you-can-eat plan was a product of an era when a call was a call. That era is closing, and pricing your own AI product as if it were still open is how you wake up subsidizing your heaviest users. The second force reshaping the economics is that the cheap option got genuinely good. For most of the last three years, "open-weights" meant "almost as good, if you squint." That is no longer true at the top. On Artificial Analysis's intelligence benchmark in April 2026, the best open models scored around 54 against 60 for the strongest closed flagship, a gap of a few points rather than a generation. Nine of the thirteen models on the intelligence-versus-price frontier are open weight. Stanford's same index puts the gap between the top US and top Chinese model at 2.7 percent as of March 2026, down from 17 to 31 points in 2023. What this means in practice is that you are no longer choosing between an expensive model that works and a free one that does not. You are choosing along a curve, and most of that curve is now usable. A model like DeepSeek V4 ships with a million-token context, runs at a fraction of frontier pricing, and can be self-hosted inside your own infrastructure. The strategic question stopped being "can we afford a good model" and became "which good model fits this specific job, at this volume, under these privacy rules." That last clause matters more here than in most places. For a business in the EU handling client data, the ability to run a competent model on your own server or inside a private cloud is not just a cost decision, it is a compliance one. The cost math on a self-hosted AI server https://studiomeyer.io/en/blog/eigener-ki-server-kosten looks very different when the alternative is shipping regulated data to a third-party API, and the models that make it viable are now good enough that the tradeoff is real rather than theoretical. Put the two forces together, cheaper-but-hungrier tokens and a deep bench of usable models, and the winning strategy stops being a single choice and becomes an architecture. The pattern practitioners keep converging on is the cascade, and it is simple to state. Send the high-volume, predictable 80 to 90 percent of work to a small or open or on-device model. Reserve the expensive frontier model for the hard tail that actually needs it. Done well, this captures most of the cost savings while keeping frontier reasoning available for the cases that justify it. The dividing line is not glamour, it is task shape. Classification, extraction, routing, and short summaries are exactly what small models do well now. Microsoft's Phi-4-mini matches the quality of a far larger model on structured extraction while running in 8 gigabytes of memory. Google's Gemma 4 edge variants are multimodal and run on a phone. These are not toys, they are the right tool for the 80 percent. The frontier model earns its price on multi-step reasoning, long-document synthesis, and open-ended agent work where the inputs are wide and unpredictable and 80 percent accuracy is not good enough. This is also why I am wary of two common reactions to the cost news. The first is "wait for prices to drop more," which misreads the paradox entirely, because your bill is driven by how many calls your design makes, not by the price of one call. The second is "just use the most expensive model for everything to be safe," which is how you turn a 2-cent task into a 20-cent one at scale for no quality gain. The discipline is matching model to job, and it is the same instinct behind treating model choice as a resilience decision https://studiomeyer.io/en/blog/ai-model-resilience rather than a brand loyalty. The agency that picks the right model for each step, and builds metering and routing in from the start, ends up with both lower costs and a system that does not fall over when one provider changes its terms. The cost of intelligence will keep falling, and your AI bill will keep being a real line item, and both of those will stay true at the same time. That is not a contradiction to resolve, it is the operating condition to design for. The teams that internalize it will build agentic products with budget caps, cascade routing, and a clear-eyed view of which model belongs on which step. The teams that wait for the technology to get cheap enough to stop thinking about cost will keep being surprised by their invoices, because the technology already got cheap and the surprise is structural. My prediction for the back half of 2026 is that "model strategy" becomes a normal part of any serious AI build, the way "database choice" is now, and that the wrapper-tax conversation gets loud. When a customer can see that their seat of tokens costs you 2 dollars, a flat 24-dollar plan starts to look like markup, and the products that survive will be the ones that separate the value they add from the inference they pass through. The cheap-model era did not make cost irrelevant. It moved cost from a price you look up to a decision you architect, and that is a better problem to have, as long as you actually treat it as one. Written by Matthias Meyer of StudioMeyer, a web and AI agency on Mallorca building MCP servers, agent fleets and AI products for small and mid-size businesses. This article was originally published on the StudioMeyer blog.