{"slug": "the-honest-math-of-ai-productivity", "title": "The Honest Math of AI Productivity", "summary": "An AI expert argues that claims of 5-10x productivity gains from AI are unsupported by rigorous studies, which show real gains of 15-40% in specific tasks and near-zero impact on whole-economy productivity. The author warns that such pitches rely on cherry-picked demos and vendor benchmarks, not peer-reviewed evidence.", "body_md": "# The Honest Math of AI Productivity\n\nAt a recent work event, I watched two speakers from an AI consultancy tell a room of executives multiple times to **strive for 5 to 10x productivity gains**. Followed by the very quickly passing remark that if people are not willing or able to fully embrace becoming “AI-native” and reach those new heights of productivity, perhaps they don’t belong in your workforce anymore, similarly to a secretary not being able to work with *Excel*.\n\nI breathe AI for a living. I build agent frameworks, I use *Claude Code* every day, *Claude Cowork*, I set up custom workspaces, custom plugins, connect services, … I know my stuff inside and outside. And my own gains from working with AI are large. I’m about as pro-AI as it gets. **BUT** the **5-10x pitch is a big giant jar of** **snake oil**. The people selling it are either **completely ignorant** of what the evidence actually says (which, given the fact they had all the other familiar statistics everyone keeps parroting in their presentation, I doubt it), or **counting on the room not to check**.\n\nSo let’s check.\n\n## The Real Numbers\n\nWhen you go looking for rigorous studies (randomized trials, peer-reviewed field experiments, government data) the gains are real and they sit in the **single-to-low-double digits**, not multiples.\n\nCustomer support is the cleanest case. A Stanford and MIT field study of over 5,000 agents [found AI raised productivity about 15% on average](https://academic.oup.com/qje/article/140/2/889/7990658), and **roughly double that for the least-experienced agents** while doing close to nothing for the most experienced. Professional writing: an MIT randomized trial published in *Science* [cut task time about 40% and raised quality 18%](https://www.science.org/doi/10.1126/science.adh2586). Management consulting: a pre-registered trial with 758 BCG consultants [found real gains, but only inside AI’s “jagged frontier”](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4573321) of suitable tasks, where work was 25% faster. On a task chosen to sit just outside that frontier, the AI users were **19 percentage points more likely to get it wrong**.\n\nNotice the ceiling. The single best department-level result in the serious literature is roughly **+40% on a narrow writing task, in a lab**. Not 400%. Not 1000%.\n\nZoom out and it shrinks. The most careful [meta-analysis to date](https://arxiv.org/abs/2605.04779) pools the productivity effect at a moderate 0.33 standard deviations, and finds it collapses by setting: decent in the lab, much weaker in real enterprises, near zero in open-source work. At the level of the whole economy, MIT’s Daron Acemoglu [estimates AI will add about 0.7% to total factor productivity over ten years](https://www.nber.org/papers/w32487). The most optimistic credible figure, from the [St. Louis Fed](https://www.stlouisfed.org/on-the-economy/2025/feb/impact-generative-ai-work-productivity), is about 1.1%, and that one is built on workers self-reporting how much time they think they saved.\n\nAs of mid-2026, there is no trustworthy study (no randomized trial, no peer-reviewed field experiment, no audited result) showing a sustained 5-10x gain in any department. Every time you see that number, trace it back. It resolves to a vendor benchmark, a demo, or a single cherry-picked task.\n\n## Why the Demo Looks Like 10x\n\nThe famous “55% faster” coding number is worth understanding, because it shows how the trick works. In [GitHub’s own study](https://arxiv.org/abs/2302.06590), developers given *Copilot* finished a task about 56% faster. The task was writing an HTTP server in JavaScript from scratch: one self-contained greenfield problem of the kind that has a thousand known solutions in the training data.\n\nNow put the same tools in a real, mature codebase. METR ran a randomized trial with experienced developers working on their own large open-source projects, and the result is the one every executive should hear: they were [ 19% slower with AI](https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/). They had forecast a 24% speedup. Even after finishing slower, they still believed AI had sped them up by 20%. A\n\n**39-point gap** between what they felt and what actually happened.\n\nThat gap is the whole con. Greenfield work is mostly boilerplate (authentication, payments, CRUD, the same API scaffolding everyone writes), and AI is genuinely fast at it. A five-year-old product is the hard case: a new feature touches thousands of lines and depends on context no document captures, and models [reliably lose the thread when the relevant details sit deep in a long context](https://arxiv.org/abs/2307.03172). So when someone generalizes the greenfield demo into “5x across your business,” they are pricing the easy 10% of the work as if it were the hard 90%.\n\nEven the greenfield case carries a tax. AI errors don’t average out, they **compound**: in one study, [nearly 20% of generated code samples imported a package that doesn’t exist](https://arxiv.org/abs/2406.10279), and the same fabricated names came back on repeated runs. A separate Stanford trial found developers with an AI assistant [wrote less secure code while feeling more confident it was secure](https://arxiv.org/abs/2211.03622). The faster you generate, the more unreviewed code you ship, and the more of your time moves into review and cleanup.\n\nThis is why the agency pitch is so dangerous. Those speakers were selling to shops that build fresh custom projects for clients, so every project looks like greenfield and every demo looks like 5x. The speed gets captured at the start. The maintenance bill gets handed to the client, who inherits a fast-built codebase nobody fully understands.\n\n## Measure Both Sides\n\nThe deeper problem is that almost nobody in that room could tell you whether they got 5x, 1.5x, or nothing, because they were only ever watching one side of the trade.\n\nThe honest metric is a comparison: your **measured performance gain against your measured cost increase**, both tracked over time. AI is worth it when the value it adds clearly beats what it adds to the bill, and you can’t know that unless you measure both halves. If you get 25% more value but spend 50% more to get it, that is a bad deal, and you won’t notice unless you tracked both.\n\nWhat gets measured instead is tokens. The community started calling it tokenmaxxing: spending more and more on AI with no measurable increase in value. It happens the moment “we used X billion tokens this quarter” lands on a slide as evidence of progress. Tokens are the cost side, read as if they were the value side. As of mid-2026, per-seat AI spend can reach a serious fraction of a salary, so that cost side is no longer a rounding error you get to ignore.\n\nMeasuring the value side is the part the keynote skipped, and it’s genuinely hard. The research is blunt that there’s no single number for it: the [SPACE framework](https://queue.acm.org/detail.cfm?id=3454124) from the people who wrote the book on engineering productivity says so directly, and the METR perception gap says why you can’t trust the gut check, since people feel multiples faster than they measurably are. So you do the slow part on both sides. Pick a few value-linked outcomes (cycle time, defect rate, time to ship), record where you stand before you adopt anything, and track the spend right next to them.\n\nSkip that, and the cost side bites in a way most companies don’t see coming. The pitch is that AI is faster **and** cheaper, so you can run leaner. The cheaper half is the shaky one: at some companies, per-seat spend already costs more than the roles it replaced or quietly stopped backfilling. A team that cut people to fund AI and now depends on it has swapped a fixed, predictable salary line for a variable bill that climbs with usage. If you were never tracking the cost side against the value side, you find out only once the efficiency you were promised has turned into **fragility**.\n\n## The Cost Lands on People and on the Company\n\nChasing an impossible number does real damage, and not only to the people chasing it.\n\nStart with the people. The early evidence is that AI tends to make knowledge workers work **more**, not less. A [peer-reviewed study of young professionals](https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1728881/full) captured the mechanism in their own words: finishing faster just earns you more assignments, while quality still has to hold. The time AI saves gets reabsorbed as a higher baseline of expected output, and the early excitement turns into something harder to sustain.\n\nNow the company. Decades of management research predict what an impossible target does to an organization, on its own, before anyone burns out. Once a number becomes the target, people optimize the number instead of the goal. The classic review [ Goals Gone Wild](https://www.hbs.edu/ris/Publication%20Files/09-083.pdf) catalogs the rest: tunnel vision that starves everything unmeasured, distorted risk-taking, a measurable rise in unethical behavior, and the best people leaving first because they’re the ones who can tell the target is fake. What’s left is theater. Dashboards performed for management while real value stays flat.\n\nWhich brings me back to the line that bothered me. When a pitch pairs an inflated target with a loyalty test, “and if they can’t get on board, maybe they don’t belong here,” that’s the tell. It converts a question you could test (“does this tool actually deliver that on our work?”) into a question of character, so the number can never be challenged, because challenging it now reads as disloyalty. The *Excel* comparison does the rest of the work, borrowing the certainty of a tool that delivered to cover for one that hasn’t, at that scale, on the evidence we have.\n\n## What the Honest Version Looks Like\n\nNone of this means AI doesn’t help. My own gains are large, and I’m not interested in pretending otherwise. But they’re **earned, not handed over at install**. They come from the unglamorous work the keynote never mentions: structuring a codebase so an agent can navigate it, building the context and guardrails it needs, and tuning a setup per project, because no single configuration works everywhere. The tool multiplies the effort you put into making it useful.\n\nThis is where the smarter pitch lives. Sure, the gains take the right setup, and the right setup is what I’m selling, so buy mine and the 5-10x appears. It’s a slippery claim, because it turns every disappointment into proof you needed more of the product. You only got 30%? **You didn’t have the right setup yet.**\n\nHold it to the same standard as everything else here: **measure it.** A setup that genuinely returns 5-10x is the easiest thing in the world to prove. Baseline your outcomes, run the setup on your real work for a quarter, and put the performance gain next to the cost. Anyone who has that result will want the test, because the numbers close the sale for them. Anyone who keeps the number in demos and testimonials, and treats your request to measure it as a lack of faith, is telling you what the measurement would show.\n\nAnd the serious studies already ran with good setups. The METR developers were experienced people on frontier tools and still came out slower on mature code, and the consulting and support trials used real deployments and still landed in the tens of percent on the tasks that suited AI. A better setup moves you toward the top of that range and widens which tasks fall inside the frontier. It doesn’t lift the ceiling to 10x, and no setup makes a senior engineer on a five-year-old codebase ten times faster. **The frontier is real, with or without a consultant selling you the way around it.**\n\nSo the honest pitch is smaller and far more useful than 5-10x. AI is a real, uneven productivity gain, largest for newer people on well-scoped work, that you have to measure at the level of value rather than tokens, and earn through setup you do before any speedup shows up. That’s a number you can actually hit, and defend.\n\nThe other number, the one you’re told to strive for or else, isn’t a target. It’s the product they’re selling.", "url": "https://wpnews.pro/news/the-honest-math-of-ai-productivity", "canonical_source": "https://eigenwise.io/writing/the-honest-math-of-ai-productivity", "published_at": "2026-06-16 00:00:00+00:00", "updated_at": "2026-06-16 12:55:50.504869+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-research", "ai-products", "ai-ethics"], "entities": ["Stanford", "MIT", "GitHub", "Copilot", "Claude Code", "Daron Acemoglu", "St. Louis Fed", "METR"], "alternates": {"html": "https://wpnews.pro/news/the-honest-math-of-ai-productivity", "markdown": "https://wpnews.pro/news/the-honest-math-of-ai-productivity.md", "text": "https://wpnews.pro/news/the-honest-math-of-ai-productivity.txt", "jsonld": "https://wpnews.pro/news/the-honest-math-of-ai-productivity.jsonld"}}