A Stanford University study published November 2025 introduced "intelligence per watt" (IPW) as a metric for measuring how efficiently AI inference systems convert energy into useful computation. The study tested 20+ local language models with up to 20 billion active parameters across 8 hardware accelerators and 1 million real-world chat and reasoning queries. Stanford researchers found local models can accurately answer 88.7% of single-turn queries, with IPW improving 5.3x from 2023 to 2025 -- driven by 3.1x accuracy gains from model improvements and 1.7x from hardware advances. A June 2026 Economic Times column by Joachim Klement frames these findings as potentially disruptive to centralized cloud AI economics.
What happened
A Stanford University team published "Intelligence Per Watt: A Study of Local Intelligence Efficiency" in November 2025 (arXiv 2511.07885), introducing IPW -- task accuracy per unit of power -- as a standardized metric for assessing inference efficiency. The study, from Stanford's Hazy Research and Scaling Intelligence Lab, evaluated over 20 local language models (LMs with 20 billion or fewer active parameters) across 8 accelerators and 1 million real-world single-turn chat and reasoning queries. A June 2026 Economic Times column by investment analyst Joachim Klement frames the findings as evidence that centralized cloud AI economics face structural pressure from improving local inference efficiency.
Key findings
The Stanford paper reports three main findings, per the published preprint. First, local LMs can accurately answer 88.7% of single-turn chat and reasoning queries; when routing to the best local model per query (best-of-local ensemble), local routing outperforms cloud routing on 3 of 4 benchmarks evaluated against Gemini 2.5 Pro, Claude 4.5 Sonnet, and GPT-5. Second, IPW improved 5.3x from 2023 to 2025, driven by 3.1x accuracy gains from model innovations (architecture, pretraining, post-training, distillation) and 1.7x from hardware advances. Third, local accelerators such as the Apple M4 Max currently achieve 1.5x lower IPW than enterprise-grade accelerators such as the Nvidia B200 running identical models, indicating meaningful headroom for local hardware optimization. ChatGPT telemetry cited in the paper shows 77% of requests are practical guidance, writing, or information-seeking tasks that may not require frontier-level capabilities.
Context and significance
The paper explicitly frames the shift as analogous to the historical transition from mainframe time-sharing to personal computing, where performance-per-watt gains enabled redistribution of compute to personal devices without PCs surpassing mainframes in raw power. Klement's Economic Times column extends this framing, arguing that improving local model efficiency could compress margins on cloud AI inference over time. Stanford reports that from 2023-2025, local query coverage -- the share of real-world queries local LMs can handle accurately -- rose from 23.2% to 71.3%, per the published study.
Scope and limitations
The Stanford study covers single-turn mainstream chat and reasoning queries. It does not benchmark agentic tasks, tool use, web navigation, long-horizon planning, or long-document processing, where local LMs lag frontier models by up to 45 percentage points, per the authors' explicit note. Software-based energy measurement may introduce inaccuracies of 10-15% per the paper's methodology section. Practitioners should treat the 88.7% coverage figure as applicable to the specific query distribution studied, not all LLM workloads.
What to watch
Observers should monitor adoption of IPW as a model and hardware evaluation metric, expansion of benchmarking to agentic and long-context tasks, and commercial announcements combining local and cloud inference in hybrid routing architectures. The DeepLearning.AI newsletter and Hazy Research blog have covered the study; independent replication and extension to enterprise workloads will be the next meaningful evidence milestones.
Scoring Rationale #
The Stanford IPW study introduces a new metric and provides large-scale empirical evidence that local LMs can handle 88.7% of real-world single-turn queries with rapidly improving efficiency -- a finding directly relevant to practitioner deployment choices and cloud infrastructure economics. The primary ingested source is an opinion column; the underlying Stanford research is a well-constructed preprint with clear methodology.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.