{"slug": "what-is-agi-why-experts-still-disagree-on-whether-we-re-there", "title": "What Is AGI? Why Experts Still Disagree on Whether We're There", "summary": "Demis Hassabis, CEO of Google DeepMind, says artificial general intelligence is \"nowhere near\" existing, while Marc Andreessen claims AGI is already here. The disagreement stems from competing definitions of AGI, with some researchers requiring genuine understanding and causal reasoning while others measure performance against human-level economic output. The unresolved debate shapes how companies invest in AI, how governments regulate the technology, and how developers assess current models' actual capabilities.", "body_md": "# What Is AGI? Why Experts Still Disagree on Whether We're There\n\nDemis Hassabis says we're nowhere near AGI. Marc Andreessen says it's already here. Learn what AGI actually means and why the debate matters for builders.\n\n## The Disagreement That Reveals Everything About AI Right Now\n\nDemis Hassabis, CEO of Google DeepMind, has said we are “nowhere near” artificial general intelligence. Marc Andreessen has said AGI is already here. Both are serious people with deep expertise. Both can’t be right.\n\nThis isn’t a minor semantic dispute. Whether or not AGI exists — or is imminent — shapes how companies invest, how governments regulate, and how builders think about what AI tools can and can’t do reliably. If you work with AI in any capacity, understanding what AGI actually means (and why experts fight about it) gives you a clearer map of the terrain.\n\nThis article breaks down the competing definitions of AGI, where current models like Gemini and GPT-4o actually land against those definitions, and why the debate is probably going to stay unresolved for a while.\n\n## Why “AGI” Doesn’t Have a Single Definition\n\nThe phrase “artificial general intelligence” sounds precise. It isn’t.\n\nDifferent researchers, labs, and commentators are using different definitions when they argue about AGI — which explains why they reach different conclusions while looking at the same evidence. Here are the main frameworks in use:\n\n### The Performance Definition\n\nThe most common lay definition: AGI is a system that can perform any cognitive task a human can perform, at least at a human level.\n\nOpenAI’s charter defines it as “highly autonomous systems that outperform humans at most economically valuable work.” This is a relatively high bar, but notably, it’s framed around economic output — not consciousness, self-awareness, or general reasoning.\n\nUnder this framing, the AGI question is partly empirical: which tasks can current systems do, and at what level?\n\n### The Understanding Definition\n\nA stricter view, held by researchers like Hassabis and Gary Marcus, is that true AGI requires genuine understanding — not pattern matching on training data, but something closer to causal reasoning, planning, and the ability to operate in genuinely novel situations.\n\nCritics of current LLMs argue that even when these models produce correct outputs, they don’t “understand” the problem. They’re doing extremely sophisticated interpolation over vast training sets. Whether that distinction is philosophically meaningful or practically relevant is itself contested.\n\n### The Generality Definition\n\nSome define AGI as the ability to learn any task from minimal examples — the way a human child can see a few examples of a new game and start playing it competently. This is closer to what the ARC-AGI benchmark was designed to test: abstract reasoning with near-zero prior exposure.\n\nBy this definition, current models have made real progress but still fall short. OpenAI’s o3 model scored around 87% on ARC-AGI in late 2024, which was a major jump — but ARC-AGI’s creator François Chollet argued that o3 achieved this through massive compute at test time, not through the kind of efficient, flexible generalization the benchmark was designed to probe.\n\n### The Self-Improvement Definition\n\nA fourth definition focuses on recursive self-improvement: an AGI can understand its own architecture well enough to improve itself, leading to rapid capability gains. Under this view, no system today is close to AGI, because none can reliably improve their own weights or architecture.\n\nThis is the “fast takeoff” scenario that drives a lot of the existential risk concern in AI safety research.\n\n## What the Major Labs Actually Say\n\nIt’s worth looking at how the people building frontier AI describe where they are.\n\n### Google DeepMind’s Framework\n\nDeepMind published a paper in late 2023 proposing five levels of AGI: Emerging, Competent, Expert, Virtuoso, and Superhuman. Each level includes both narrow (single-task) and general variants.\n\nBy their own framework, they placed current systems — including their own Gemini models — at roughly Level 1 to Level 2 on the general track: somewhere between “emerging” and “competent” AGI. Superhuman performance already exists in narrow domains like chess and protein folding, but general-purpose superhuman capability isn’t there yet.\n\nThis is a notably conservative self-assessment from one of the leading AI labs.\n\n### OpenAI’s Position\n\nOpenAI hasn’t publicly said AGI exists yet, which matters given that their founding charter commits them to stop seeking investor profit once AGI is achieved. Sam Altman has said he expects AGI within “a few years” — but that’s a prediction, not a declaration.\n\nInternally, OpenAI reportedly uses a five-level framework similar to DeepMind’s. By some accounts, they classified their o1 model as Level 2 (“Reasoners”) — capable of complex reasoning but not yet approaching human-expert-level performance across all domains.\n\n### Why Marc Andreessen Disagrees\n\n## One coffee. One working app.\n\nYou bring the idea. Remy manages the project.\n\nAndreessen’s position is that the semantic gatekeeping around AGI is motivated reasoning — that researchers keep moving the goalposts because they don’t want to admit the milestone has been reached. His argument: if you can have a conversation with an AI about any topic, at a level that passes for competent human discourse, something meaningfully general has been achieved.\n\nThis is a coherent position, but it reflects a different definition of “general” than most AI researchers use. He’s measuring conversational breadth; they’re measuring reasoning depth, robustness, and out-of-distribution performance.\n\n## Where Current Models Actually Land\n\nSetting aside definitions, it’s useful to look at what today’s best models can and can’t do.\n\n### What Frontier Models Do Well\n\nModels like Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet can:\n\n- Write code across dozens of programming languages\n- Summarize, analyze, and synthesize large documents\n- Pass professional licensing exams (bar exam, medical boards) at high percentile scores\n- Handle multi-step reasoning in math and logic, especially when given structured prompts or chain-of-thought scaffolding\n- Generate coherent, contextually appropriate text across virtually any topic\n- Process images, audio, and video with increasing sophistication\n- Maintain context over long conversations and documents\n\nThese are not trivial capabilities. Five years ago, none of this was possible at production quality.\n\n### Where They Still Struggle\n\nThe failure modes of current models are equally important:\n\n**Reliability on novel problems**: Models perform well on tasks that resemble training data and degrade on genuinely novel setups.** Consistent multi-step planning**: Long-horizon planning with many interdependent steps remains unreliable without external scaffolding.** Factual accuracy**: Hallucination — confidently stating false information — is still a real problem, especially in specialized domains.** Causal reasoning**: Models can describe causal relationships but struggle to reliably apply causal logic in novel scenarios.** Tool use and embodied tasks**: Integrating reasoning with action in the physical world (robotics) is far less developed than language tasks.\n\nThe pattern is consistent: models excel at tasks that look like sophisticated interpolation over prior knowledge and struggle with tasks that require genuine extrapolation or persistent, accurate reasoning.\n\n## Why This Debate Isn’t Just Semantic\n\nIt would be easy to dismiss the AGI debate as a terminology fight. But the definition you use has real practical consequences.\n\n### Investment and Development Priorities\n\nIf you believe AGI is close, you probably invest heavily in safety research, alignment, and governance. If you believe current systems are fundamentally limited, you focus on scaling, fine-tuning, and application development. Labs with different AGI timelines make different bets on where to put resources.\n\n### Regulation and Policy\n\nGovernments are trying to write AI regulation without knowing what they’re regulating. The EU AI Act, executive orders in the US, and emerging frameworks in the UK all try to categorize AI systems by capability and risk. Whether frontier models constitute “general-purpose AI” (which triggers additional oversight) depends on definitions that are currently contested.\n\n### How Builders Should Think About AI Tools\n\nFor people building on top of AI — which is the practical context for most readers — the AGI debate has a concrete implication: **current models are powerful but brittle in predictable ways**.\n\nUnderstanding where models fail helps you design around those failures. If you’re building an AI agent that needs to execute a 15-step workflow reliably, knowing that multi-step planning is a weak point tells you to break the task into smaller, verifiable steps rather than asking the model to plan everything upfront.\n\nThis is less exciting than AGI discourse, but it’s what actually determines whether your AI application works.\n\n## The Benchmarks That Shape the Conversation\n\nA big part of why experts disagree is that they’re measuring different things.\n\n### ARC-AGI\n\nThe Abstraction and Reasoning Corpus (ARC) was designed by François Chollet to test the kind of fluid intelligence that humans apply to novel problems. Each task shows a few input-output examples using colored grids, and the model must identify the underlying rule and apply it to a new input.\n\nHumans solve most ARC tasks without difficulty. Early LLMs failed badly. OpenAI’s o3 scored around 87% on the public benchmark — a genuine breakthrough. But as noted above, it required massive test-time compute, and Chollet argues the method used doesn’t reflect the kind of efficient generalization the benchmark was designed to measure.\n\n### MMLU and Professional Exams\n\nMassive Multitask Language Understanding (MMLU) tests knowledge across 57 subjects including law, medicine, history, and mathematics. Current models score above 85–90% on MMLU — above average human performance. But critics note that MMLU tests recall and pattern recognition, not reasoning.\n\nSimilarly, passing the bar exam is impressive but potentially misleading. Models trained on vast amounts of legal text have seen many exam questions that look like the ones being tested. It doesn’t necessarily mean the model can reason like a lawyer on a genuinely novel case.\n\n### Chatbot Arena\n\nChatbot Arena (from UC Berkeley’s LMSYS lab) uses human preference ratings — real users voting on which model response they prefer. This is a different kind of benchmark: it measures what humans actually find useful and coherent, not formal task performance.\n\nBy this measure, frontier models from Google, OpenAI, and Anthropic cluster near the top with similar scores. The differences between them are often small and context-dependent.\n\nThe benchmarks tell different stories because they’re measuring different things. None of them is a clean proxy for AGI under any of the definitions above.\n\n## How MindStudio Fits Into This Picture\n\nUnderstanding AGI’s limits isn’t just academic — it’s practical guidance for how to build AI applications that actually work.\n\nCurrent AI models are remarkable tools with real constraints. The builders getting the most value from them aren’t treating AI as an autonomous problem-solver. They’re designing systems where AI handles what it’s genuinely good at (language understanding, generation, classification, summarization) while structure and guardrails handle the rest.\n\nThat’s exactly the design philosophy behind MindStudio. Rather than asking a single model to reason through a complex business process end-to-end, MindStudio lets you build multi-step AI workflows where each step is scoped, verifiable, and connected to real data sources and tools. You pick from [over 200 models](https://mindstudio.ai) — including Gemini 1.5 Pro, GPT-4o, Claude, and others — and route different tasks to whichever model handles them best.\n\nFor example, you might use Gemini for document understanding, a specialized model for structured data extraction, and a reasoning model for synthesis — all within a single automated workflow that connects to your CRM, sends emails, or updates a project board. The average build takes 15 minutes to an hour, and no code is required.\n\nThis kind of architecture — decomposing complex tasks, using the right model for each step, and building in human checkpoints where reliability matters — is how teams at companies like Microsoft, Adobe, and TikTok are using AI productively right now, without waiting for AGI.\n\nYou can try MindStudio free at [mindstudio.ai](https://mindstudio.ai).\n\n## Frequently Asked Questions About AGI\n\n### What is the difference between AGI and AI?\n\nCurrent AI systems are narrow: they’re trained for specific tasks or domains, even when those domains are broad (like “generate text about any topic”). AGI refers to a system with flexible, general-purpose intelligence comparable to a human’s — able to learn new tasks from minimal examples, reason across domains, and operate reliably in genuinely novel situations. Most researchers agree we have narrow AI; whether we have AGI depends on which definition you use.\n\n### Is ChatGPT or Gemini an AGI?\n\nBy most technical definitions used in the research community, no. Gemini, ChatGPT, and similar models are extremely capable language models that can handle a wide range of tasks — but they have well-documented failure modes in novel reasoning, long-horizon planning, and out-of-distribution problems. By some broader definitions (conversational generality, breadth of topic coverage), some would argue yes. The honest answer is: it depends on what definition you’re using.\n\n### What would it take to actually reach AGI?\n\nThis is contested, but common answers include: the ability to learn new tasks efficiently from just a few examples (sample efficiency), reliable causal reasoning, robust performance on genuinely novel problems (not just variations of training data), persistent memory and long-horizon planning, and potentially the ability to improve one’s own capabilities. Some researchers also include self-awareness and consciousness, though many argue those are separate questions.\n\n### Why do some people say AGI is already here?\n\nThe “already here” camp, associated with figures like Marc Andreessen, argues that current models have achieved something meaningfully general — they can converse about any topic, assist with any knowledge work, and operate across domains in a way no prior software could. The counterargument is that conversational breadth isn’t the same as general intelligence, and that the specific failure modes of current models (hallucination, poor novel reasoning, brittle multi-step planning) represent fundamental limitations, not just scaling problems.\n\n### Is AGI dangerous?\n\nThe concern isn’t current AI — it’s the potential trajectory. A system that could genuinely improve itself, set its own goals, and operate autonomously across any domain raises alignment challenges: how do you ensure it pursues goals that are actually aligned with human interests? This is the focus of AI safety research at DeepMind, Anthropic, and OpenAI. Most safety researchers argue the risk isn’t immediate but that laying the groundwork now matters because the transition, if it happens, could be fast.\n\n### What is “superintelligence” and how does it differ from AGI?\n\n### Everyone else built a construction worker.\n\nWe built the contractor.\n\nOne file at a time.\n\nUI, API, database, deploy.\n\nAGI typically means human-level general intelligence. Superintelligence means intelligence that substantially exceeds human capabilities across all domains. Nick Bostrom’s influential 2014 book *Superintelligence* popularized the concept. The concern is that a system crossing the AGI threshold might rapidly self-improve toward superintelligence — the so-called “intelligence explosion.” Most researchers see this as a future risk scenario rather than an imminent reality, but it’s a significant driver of safety research funding.\n\n## Key Takeaways\n\n**AGI doesn’t have a single agreed-upon definition**— which is the main reason experts disagree about whether we’re there. They’re measuring different things.** Current frontier models**(Gemini, GPT-4o, Claude) are genuinely impressive but have well-documented weaknesses in novel reasoning, long-horizon planning, and factual reliability.**The major labs’ own frameworks** place current systems at early-to-intermediate AGI levels by their own definitions — far from the full vision.**The debate has real stakes** for investment, regulation, and how builders design AI systems that actually work reliably.**Practical takeaway for builders**: Understanding where AI is reliable and where it isn’t is more useful than waiting for an AGI verdict. Design workflows that play to AI’s strengths and build in structure where it’s weak.\n\nFor anyone building with AI today — whether you’re automating a business process, building an AI-powered product, or experimenting with what models can do — the AGI debate is interesting context, but the actionable question is simpler: what can this model do reliably enough to deploy? Start there, and tools like [MindStudio](https://mindstudio.ai) can help you find out fast.", "url": "https://wpnews.pro/news/what-is-agi-why-experts-still-disagree-on-whether-we-re-there", "canonical_source": "https://www.mindstudio.ai/blog/what-is-agi-expert-disagreement-explained/", "published_at": "2026-05-29 00:00:00+00:00", "updated_at": "2026-05-29 21:29:14.500393+00:00", "lang": "en", "topics": ["artificial-intelligence", "ai-research", "ai-policy", "ai-ethics", "ai-products"], "entities": ["Demis Hassabis", "Google DeepMind", "Marc Andreessen", "Gemini", "GPT-4o"], "alternates": {"html": "https://wpnews.pro/news/what-is-agi-why-experts-still-disagree-on-whether-we-re-there", "markdown": "https://wpnews.pro/news/what-is-agi-why-experts-still-disagree-on-whether-we-re-there.md", "text": "https://wpnews.pro/news/what-is-agi-why-experts-still-disagree-on-whether-we-re-there.txt", "jsonld": "https://wpnews.pro/news/what-is-agi-why-experts-still-disagree-on-whether-we-re-there.jsonld"}}