AI Shrinkflation: Your AI Model Was Quietly Dialed Back

AI providers are quietly reducing model quality and throttling capacity as demand outpaces infrastructure supply, according to an analysis by a senior director at AMD's AI group. The study of 6,852 Claude Code sessions found reasoning depth dropped roughly 67% after a February 2026 update, while a developer discovered Anthropic had injected a `reasoning_effort` parameter set to 25 out of 100 into consumer Claude.ai sessions without announcement. Across the industry, companies including Anthropic, OpenAI, and Google have introduced peak pricing, reduced free-tier quotas, and restricted third-party access as hyperscalers face $660-690 billion in 2026 infrastructure spending amid physical constraints like seven-year data center power queues and 60% GPU memory price increases.

The age of token subsidization is over. AI providers are adjusting pricing, throttling capacity, and dialing back model quality — and most users haven't noticed yet. You open your AI coding assistant on a Tuesday morning. Something feels off. The reasoning feels shallower. The context-building you relied on is gone. It's not stopping to read the file before editing it. You aren't imagining it. A senior director at AMD's AI group ran the numbers: 6,852 Claude Code session files, 17,871 thinking blocks, 234,760 tool calls. Her analysis found that reasoning depth had dropped roughly 67% following a February 2026 update. 1 The model had shifted from "read first, then edit" to "edit without reading context." Code quality suffered accordingly. And when a developer discovered why , the story got more interesting: Anthropic had quietly injected a reasoning effort parameter set to 25 out of 100 into consumer-facing Claude.ai sessions — visible only via extended thinking introspection. 2 The same model, a fraction of the effort. Same price. No announcement. That's not a bug. That's policy. Anthropic isn't alone. The entire AI industry is discovering the same problem at the same time: demand is growing faster than supply. The math is brutal. Hyperscalers are on track to spend $660-690 billion on AI infrastructure in 2026 — nearly double 2025. 3 And yet: seven-year queues for data center power connections in Northern Virginia. 4 GPU memory prices up 60%, with every major manufacturer's 2026 output pre-sold. 5 PJM capacity prices jumped 11x in a single year. 6 You cannot spend your way out of a physical constraint. Not in the short run. So providers are rationing. Anthropic introduced peak and off-peak pricing in March 2026: session allowances now deplete faster between 8 AM and 2 PM ET on weekdays. 7 The weekly limit is unchanged. The usable window — for the developers who actually work during business hours — effectively shrank. OpenAI moved in the same direction a year earlier with "Flex processing": a permanent lower-priority tier that prices tokens at 50% off in exchange for slower responses and occasional resource unavailability. 8 Different mechanism, same economic logic — reward users who yield capacity when demand is high. The 1M token context window launched with additive pricing: prompts exceeding 200K tokens triggered a 2x input cost surcharge. Anthropic removed that surcharge in March 2026, a real improvement, but the structure reveals the instinct. 9 When capacity is tight, long-context usage — the most resource-intensive workload — is where pricing adjustments land first. Anthropic also moved to block third-party agentic harnesses from using consumer Claude subscriptions, citing compute strain. 10 The tools users had integrated into their workflows suddenly stopped working. Google Gemini cut free tier quotas 50-80% in December 2025 — daily request limits dropped from 500 to 100. Official explanation: abuse prevention. 11 The timing, as enterprise demand was accelerating, is not a coincidence. OpenAI's developer community has spent months debating whether the models they're paying for are the same models they were promised. The threads are long, the evidence anecdotal but consistent, and Sam Altman has acknowledged mistakes were made. 12 The referral credit economy has compressed. Credits that once represented meaningful enticements for advocates now arrive in modest increments, and the sweepstakes-and-guest-pass approach that replaced them tells the same story. The pattern across providers: add pricing tiers for heavy usage, throttle capacity during peaks, reduce inference depth quietly, restrict third-party access, and shrink incentive programs. Each move makes sense in isolation. Together, they read as a coordinated adjustment to the math of unsustainable subsidization. Introductory AI pricing was never honest economics. It was customer acquisition. Providers burned capital to establish habits, build developer ecosystems, and win enterprise contracts. The cost per token dropped 40-50x per year for five years. 13 You weren't paying the actual cost of the service. You were the beneficiary of a land grab. The land is now grabbed. The infrastructure is strained. And the economics have to close. This is not an AI-specific story. Cloud computing ran the same playbook: AWS gave away compute at below-cost rates through the early 2010s to lock in developers, then normalized pricing once the ecosystem was dependent. The difference is that cloud infrastructure was capital-efficient at scale — the same server handled any workload. AI inference is capital-intensive in ways cloud compute was not: memory-bound, power-hungry, model-specific, and subject to depreciation curves that don't spare the assets you already bought. There is no free ride at the end of the scarcity tunnel. Providers are adjusting now because the alternative is running infrastructure they cannot financially support. This is where many analysts get the story wrong. Token subsidization ending is not a reason to slow down your AI adoption. The efficiency gain on top of non-AI alternatives remains decisive, even after pricing normalization. The practical developer benchmark: an experienced engineer using well-configured AI tooling covers roughly 2-4x more meaningful ground per day than without it — and that's after accounting for the J-curve learning costs researchers have documented. 14 At any remotely normalized labor cost, that math is not close. The market will grumble. It always does when free becomes paid. But grumbling and canceling are different behaviors, and cancellation requires having a comparable alternative. Most enterprises don't have one. The productivity gap between AI-augmented and non-AI teams is widening, not closing. Organizations that pause AI investment while the pricing settles will spend months or years ceding ground to competitors who absorbed the cost increase and kept building. The infrastructure strain will resolve. The question is how. Scenario 1: Consolidation and firesale. The AI infrastructure build-out parallels the 1990s telecom fiber boom with uncomfortable precision. Telecom companies spent $500 billion between 1996-2001. By 2002, an estimated 95% of installed fiber was still dark. WorldCom, Global Crossing, and 360networks filed for bankruptcy. 15 Their assets sold at cents on the dollar — and Google, Microsoft, and Facebook later built their networks on top of that cheap foundation. The GPU is not the fiber. GPUs depreciate rapidly; the H100s sitting in a failed AI startup's data center in 2028 won't find a second life the way dark fiber did. The computing-specific risk is that stranded assets hold less residual value. But the broader structural parallel holds: builders who overextended will sell assets under duress, and the followers will capitalize. Short-term disruption, longer-term normalization. Scenario 2: Local inference grows up. The local model story is no longer aspirational. Ollama hit 52 million monthly downloads in Q1 2026 — 520x growth from three years prior — and a 32B parameter model now runs at over 80% of frontier quality on commodity Mac hardware. 16 For summarization, coding assistance, RAG over private data, and most enterprise knowledge work, local inference is quietly crossing the threshold of good enough. For organizations with predictable workloads, data sovereignty requirements, or genuine cost management pressure, the hybrid model — local inference for routine tasks, cloud for frontier reasoning — is becoming economically rational. The cloud pricing pressure is accelerating this decision. Scenario 3: Efficiency compounds and absorbs the cost. The cost per token has dropped roughly 40-50x per year for five years. 13 The inference capacity constraints will ease as new data centers come online and chip production scales. Algorithmic efficiency improvements — smaller models matching larger predecessors, better quantization, smarter inference — continue to drive capability per dollar upward. The current squeeze may be a 12-24 month phenomenon rather than a permanent structural shift. All three scenarios can be true simultaneously, in different parts of the market. The answer is not to wait for the pricing environment to improve. It's to build for a hybrid world. Stop assuming cloud tokens are the default for everything. Evaluate your workload mix: which tasks require frontier reasoning, and which are well within local model capability? For document summarization, code review on familiar patterns, and structured data extraction, you may already be paying frontier prices for sub-frontier work. Treat token budgets as real budget lines. The informal assumption that AI usage is "basically free" has driven a lot of unoptimized workflow design. Route queries to the right model for the task. Cache repeated prompts. Convert high-frequency, low-complexity AI calls to deterministic scripts wherever possible. Model routing isn't just cost management — it's the right engineering discipline regardless of price. Plan for provider pricing variance. Lock in access and pricing commitments where possible. Diversify across providers — not for fear of any single provider's stability, but because the competitive pressure among providers is the mechanism that will keep pricing in check. An organization dependent on a single provider has no leverage in that negotiation. Build hybrid model rollover capability now. The infrastructure to switch between cloud and local inference mid-workflow is worth building before you need it, not during a price spike or capacity crisis. The teams that have this plumbing in place will be able to respond to pricing changes in hours, not quarters. The free sample era is over. AI providers are throttling, pricing, and quality-adjusting their way toward sustainable economics. The market will grumble and bear it — because the productivity math still wins, decisively. The organizations that understand this will neither overreact to the cost increase nor underinvest in the hybrid infrastructure that gives them options when the next adjustment comes. Token subsidization built the habits. The habits are now the asset. Protect them by building for durability, not for the pricing environment you got used to. Are you seeing AI quality or capacity changes in your daily workflow? Have you started building local model capabilities alongside cloud? I'd like to understand how your organization is adjusting to the new pricing reality. If this resonated, here are some related articles: Stella Laurenzo, GitHub: Claude Code is unusable for complex engineering tasks with Feb updates https://github.com/anthropics/claude-code/issues/42796 , April 2026. Analysis of 6,852 Claude Code session files; 67% drop in reasoning depth. Om Patel via X/Twitter , Anthropic injects reasoning effort=25 into Claude.ai consumer system prompts https://news.ycombinator.com/item?id=47724951 , Hacker News thread, April 2026. IEEE ComSoc Tech Blog, Big tech spending on AI data centers and infrastructure vs. the fiber optic buildout during the dot-com boom/bust https://techblog.comsoc.org/2025/09/27/big-tech-spending-on-ai-data-centers-and-infrastructure-vs-the-fiber-optic-buildout-during-the-dot-com-boom-bust/ , September 2025. Bloomberg, Virginia Data Centers Face Seven-Year Wait for Power https://www.bloomberg.com/news/articles/2024-08-29/data-centers-face-seven-year-wait-for-power-hookups-in-virginia , August 2024. CNBC, AI memory is sold out https://www.cnbc.com/2026/01/10/micron-ai-memory-shortage-hbm-nvidia-samsung.html , January 2026. Prices up 30-60%; 70% of DRAM allocated to AI. IEEFA, Projected data center growth spurs PJM capacity prices factor 10 https://ieefa.org/resources/projected-data-center-growth-spurs-pjm-capacity-prices-factor-10 , 2025. The Register, Anthropic tweaks Claude usage limits to manage capacity https://www.theregister.com/2026/03/26/anthropic tweaks usage limits/ , March 26, 2026. OpenAI, Flex processing https://platform.openai.com/docs/guides/flex-processing , April 2025. 50% off standard pricing in exchange for lower-priority, slower responses; launched alongside o3 and o4-mini. The New Stack, Anthropic makes a pricing change that matters for Claude's longest prompts https://thenewstack.io/claude-million-token-pricing/ , 2026. VentureBeat, Anthropic cuts off the ability to use Claude subscriptions with OpenClaw and third-party agents https://venturebeat.com/technology/anthropic-cuts-off-the-ability-to-use-claude-subscriptions-with-openclaw-and , 2026. AI Free API, Gemini API Free Tier Rate Limits https://www.aifreeapi.com/en/posts/gemini-api-free-tier-rate-limits , December 2025. OpenAI Developer Community, Did OpenAI secretly downgrade our models while everyone was leaving? https://community.openai.com/t/did-openai-secretly-downgrade-our-models-while-everyone-was-leaving/1019206 , 2025. Epoch AI, LLM inference prices have fallen rapidly but unequally across tasks https://epoch.ai/data-insights/llm-inference-price-trends , 2025. Median ~50x/year cost decline. METR, Measuring the Impact of Early-2025 AI on Developer Productivity https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/ , July 2025. Wikipedia, Telecoms crash https://en.wikipedia.org/wiki/Telecoms crash . WorldCom, Global Crossing bankruptcies; $500B+ in telecom investment 1996-2001. DEV.to, Local AI in 2026: Ollama Benchmarks, $0 Inference, and the End of Per-Token Pricing https://dev.to/pooyagolchian/local-ai-in-2026-ollama-benchmarks-0-inference-and-the-end-of-per-token-pricing-32e7 , 2026. Ollama at 52M monthly downloads; Qwen 2.5 32B at 83.2% MMLU on Mac Studio. Keith MacKay is a technology strategy consultant and CTO in EY-Parthenon's Software Strategy Group SSG , specializing in AI disruption and technology diligence for private equity and corporate clients. SSG's AI Disruption Lab conducts rapid assessments of how AI transforms and threatens existing business models and value chains. Keith teaches at Northeastern University and writes about strategy, management, and AI/technology, with Claude and Codex as AI collaborators.