{"slug": "gpt-5-5-hallucinates-3x-more-than-mit-licensed-glm-5-2", "title": "GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2", "summary": "New benchmarks reveal that larger AI models like GPT-5.5 and DeepSeek V4 Pro hallucinate significantly more than smaller open-weight models, with GLM-5.2 achieving a 28% hallucination rate compared to GPT-5.5's 86%. The findings challenge the industry's focus on scaling parameters and training data, as models with fewer parameters demonstrate superior factual accuracy and logical reasoning.", "body_md": "# Bigger models are not the way\n\nJun 18, 2026\n\nA shift is happening among major AI labs, who are becoming increasingly skeptical of endless parameter count and training data scaling.\nThe limits of this paradigm were put on the world’s stage when [Claude Fable 5 was restricted](https://www.anthropic.com/news/fable-mythos-access)\nby the US government just three days after its release, marking the first US AI ban stemming from national security. One of the biggest\nmodels in the world was banned because a single jailbreak was too much of a risk.\n\n## Bigger is better\n\nThe above is true in almost all cases. The biggest models in the world clearly score the highest on the Artificial Analysis Intelligence\nIndex. Yet, Z.ai’s newest, GLM-5.2 (753B parameters, roughly 40B active), [comes within just 4 points of GPT-5.5 and 9 points of Fable 5](https://artificialanalysis.ai/evaluations/artificial-analysis-intelligence-index#artificial-analysis-intelligence-index-score).\nOpus 4.8 and GPT-5.5 are proprietary and estimated to be in the 1-2T parameter range conservatively. If an open weight ([MIT licensed](https://huggingface.co/zai-org/GLM-5.2/blob/main/LICENSE))\nLLM can come so close to a closed weight model estimated to be 1.5 to 2 times bigger, it is clear that actual intelligence has plateaued\nsignificantly.\n\n## Bigger is not better\n\nIt’s been proven that when a model is trained on large volumes of highly factual and non-theoretical data, it learns to always have an\nanswer. DeepSeek V4 Pro (1.6T params, 49B active, 44 AA Intelligence Index score) [has a ludicrous 94% hallucination score](https://artificialanalysis.ai/evaluations/omniscience#omniscience-hallucination-rate-tabs)\non the AA-Omniscience benchmark, meaning on questions that it couldn’t figure out, it only stated that it didn’t know around 6% of the\ntime, and the rest it confidently hallucinated an answer. GLM-5.2 scored a 28% hallucination rate, Opus 4.8 was 36%, Fable 5 was 48%, and\nGPT-5.5 was 86%.\n\nThat seems incredibly rough for such a huge, popular model. Let’s test it with a relatively complex Python question with a clear architectural\nflaw.[1](#user-content-fn-bench-details)\n\nDeepSeek V4 Pro used almost 10 times the reasoning tokens yet produced a confidently incorrect response. On the other hand, it took GLM-5.2 just 12 seconds and about 800 reasoning tokens to recognize the technical impossibility of a single-threaded task executing multiplexed I/O without ever yielding or utilizing system polling. (For the non technical, this is like asking a delivery driver to drop off packages at 3 houses at the same time without ever stopping the truck.)\n\nGPT-5.5 and DeepSeek V4 Pro are two of the clearest hallucination leaders, despite being absolutely huge. Because of their immense size they simply did not learn how to say “I don’t know” or recognize intricate logical and technical fallacies. While it is true that a multi-trillion parameter model will always beat a lightweight consumer model on paper (today at least), the commoditization of these huge models is blurring the line between benchmark performance and actual real-world truthfulness and accuracy.\n\n## The trilemma of modern AI\n\nWe should be very cautious about blindly increasing reasoning budget, corpus size, or parameter count. DeepSeek V4 Pro spent 3 minutes and 26\nseconds wasting compute in a reasoning loop ([raw reasoning here](/bigger-models/deepseek-reasoning.txt)) just to generate a beautifully structured,\nconfidently incorrect solution. Yet, a model half its size identified the paradox almost instantaneously. Even in today’s era as we near AGI,\nmany of the biggest models will actively convince you that a solution is correct and that the problem was solvable as stated.\n\nMoving forward, the industry cannot continue to train bigger and bigger models since their intelligence not only plateaus but often will get worse. This applies for the consumer too, since we cannot continue to select models based on size or theoretical performance alone. Training and selection of AI needs to be designed around the unsolved trilemma of modern LLMs: raw capability, uncertainty calibration/hallucination rate, and computational efficiency.\n\n## Footnotes\n\n-\nBoth models were given “high” reasoning effort, temperature 1, tested on OpenRouter, with the following system prompt: “You respond professionally. You are a highly capable coding assistant well-versed in Python.” GLM-5.2 was served by Z.ai (FP8 precision) and DeepSeek V4 Pro was served by Baidu Qianfan (FP8 precision).\n\n[↩](#user-content-fnref-bench-details)", "url": "https://wpnews.pro/news/gpt-5-5-hallucinates-3x-more-than-mit-licensed-glm-5-2", "canonical_source": "https://arrowtsx.dev/bigger-models/", "published_at": "2026-06-19 16:11:25+00:00", "updated_at": "2026-06-20 07:36:11.493972+00:00", "lang": "en", "topics": ["large-language-models", "ai-research", "ai-safety", "ai-ethics"], "entities": ["Z.ai", "GLM-5.2", "GPT-5.5", "DeepSeek V4 Pro", "Claude Fable 5", "Opus 4.8", "Anthropic", "Artificial Analysis"], "alternates": {"html": "https://wpnews.pro/news/gpt-5-5-hallucinates-3x-more-than-mit-licensed-glm-5-2", "markdown": "https://wpnews.pro/news/gpt-5-5-hallucinates-3x-more-than-mit-licensed-glm-5-2.md", "text": "https://wpnews.pro/news/gpt-5-5-hallucinates-3x-more-than-mit-licensed-glm-5-2.txt", "jsonld": "https://wpnews.pro/news/gpt-5-5-hallucinates-3x-more-than-mit-licensed-glm-5-2.jsonld"}}