# Gödel and Turing outlined the limits of AI

> Source: <https://www.heise.de/en/blog/How-Goedel-and-Turing-outlined-the-limits-of-AI-11313804.html>
> Published: 2026-06-03 09:54:05+00:00

# How Gödel and Turing outlined the limits of AI

Hallucinations of language models are not an engineering problem: they stem from mathematical limits proven since the 1930s.

- Golo Roden

During their studies, students encounter names like [Kurt Gödel](https://de.wikipedia.org/wiki/Kurt_G%C3%B6del) and [Alan Turing](https://de.wikipedia.org/wiki/Alan_Turing), usually with the same mix of respect and mild resignation. One reads what they have proven, accepts it as impressive, and then files the knowledge away in that mental archive located somewhere between “interesting” and “I'll never need this again.” A few years ago, I wouldn't have believed anyone who said that the incompleteness of arithmetic or the undecidability of the halting problem could one day coincide with the question of why a chatbot is currently hallucinating a book title at me.

Yet, that is precisely what has happened. Several scientific papers from the last three years show that the hallucinations of language models are not an implementation error that could be trained out with more data or a better architecture. They stem from the same theoretical limits that once shattered the most ambitious project of modern mathematics. Once you see this connection, you read the current debate about the future of AI with a much more sober perspective.

### The Promise of 1928

To understand why the research landscape of AI is the way it is today, it's worth taking a detour through the International Congress of Mathematicians in Bologna in 1928. There, [David Hilbert](https://de.wikipedia.org/wiki/David_Hilbert), one of the most influential mathematicians of his time, together with Wilhelm Ackermann, formulated [a research program](https://de.wikipedia.org/wiki/Hilbertprogramm) that was intended to place all of mathematics on a completely new foundation. Three properties were to underpin this foundation:

- Consistency, i.e., the certainty that no contradictory statements can be derived from the axioms.
- Completeness, i.e., the guarantee that every true statement within the system can also be proven.
- And decidability, i.e., the existence of a procedure by which it can be decided in a finite number of steps whether any given statement follows from the axioms.

The third requirement has gone down in history as the [Entscheidungsproblem](https://en.wikipedia.org/wiki/Entscheidungsproblem) (decision problem). Behind it lay a concrete vision. Mathematics was to become mechanisable. A machine that applies a finite number of rules to a finite number of symbols should, in principle, be able to answer any mathematical question. Those who recognize the shadow of what we now call computers in this vision are not mistaken. Hilbert conceived of the computer long before it existed.

His optimism remained unbroken. In September 1930, he gave a famous radio address in Königsberg, which he concluded with the sentence, “We must know, we will know.” It was the last great proclamation of a mathematics that still felt capable of anything. A few days earlier, at the same location, during a parallel epistemology conference, a young man named Kurt Gödel, in a casual remark, first sketched out the results that were to shake Hilbert's program to its foundations. The fact that the two events occurred so close together is a historical coincidence that one should not overemphasize symbolically. But it is certainly interesting.

### Gödel Without Formulas

What Gödel presented in 1930 and published in detail in 1931 can be explained surprisingly well without mathematical notation. His trick relies on a procedure that anyone who has seen a self-calling function will understand: self-reference. Imagine a sufficiently powerful formal system in which arithmetic statements can be formulated. Gödel showed how, within such a system, a statement can be constructed that essentially says, “This statement is not provable within the present system.”

At this point, one should pause briefly because the consequence is unavoidable. If the statement is true, then there exists a true arithmetic statement that the system cannot prove. Thus, the system is incomplete. If the statement is false, then the system proves a statement that claims to be unprovable, even though it apparently is. Thus, the system is inconsistent. There is no third way.

Anyone familiar with the liar paradox from ancient philosophy who accuses himself of lying will recognize the pattern. However, what Gödel achieved was not a philosophical game but a strictly formal proof within arithmetic itself. He showed that self-reference is not limited to common-language statements but appears in any sufficiently expressive formal system.

Gödel did not show that mathematics is broken. He showed that there are fundamental limits at which any sufficiently powerful theory is thrown back upon itself. Hilbert's vision of a mechanizeable, self-contained mathematics thus developed a crack that could no longer be mended. The consequence was later formulated as follows: Any system that has enough expressive power to talk about itself has blind spots that cannot be optimized away.

What is particularly remarkable about Gödel's result is that it does not depend on a specific theory. It applies to arithmetic, to set theory, and to any formal theory that has sufficient expressive power to describe the natural numbers along with addition and multiplication. If you change the system, you only change the guise of the limit, not the limit itself. In the decades following Gödel, numerous other theorems were proven, showing that self-reference exacts its toll in surprisingly (or frighteningly) many places. The [halting problem](https://de.wikipedia.org/wiki/Halteproblem) is just the best-known example.

### Turing Makes Gödel Practical

Six years after Gödel's proof, a paper appeared in the Proceedings of the London Mathematical Society that is now considered one of the founding documents of computer science: Alan Turing's essay “On Computable Numbers, with an Application to the Decisions Problem.” Turing had asked himself what it actually means for a procedure to be mechanically executable. For this, he devised a conceptual construct that we now call a Turing machine and used it to put an end to Hilbert's third requirement. In the same year and independently of Turing, Alonzo Church arrived at a similar conclusion via the lambda calculus.

Turing's result is known as the halting issue. It can be summarized in one sentence: There is no general procedure that can decide, for any given program and any given input, whether the program will ever terminate. This statement initially sounds harmless but has far-reaching implications. It doesn't say that we haven't found such a procedure yet. It says that such a procedure can never exist.

Anyone working in software development today lives with the consequences of this proof without being consciously aware of it every day. Static code analysis encounters fundamental limits on substantial questions about program behavior. Compiler optimizers have to use heuristics because a complete analysis of some code paths remains undecidable. Formal verification only works really well where the problem being examined is restricted to decidable fragments. We have become accustomed to dealing with the consequences of a mathematical limit without having to name it every time.

The situation is exacerbated by a result proven by Henry Gordon Rice in 1953. [His theorem](https://de.wikipedia.org/wiki/Satz_von_Rice) states that any non-trivial semantic property of a program is undecidable. This means that not only the question of termination is fundamentally open, but practically any interesting question about program behavior. Whether a certain code path will ever be executed, whether two programs are functionally equivalent, whether a program never produces a certain output: for all of these, there is no general decision procedure. What appears as tedious heuristics in professional software development is, at its core, the same limit that Hilbert hoped not to encounter in 1928.

With Turing's work, Hilbert's program in its original form was definitively over. Consistency and completeness had been shattered by Gödel, and decidability was taken home by Turing and Church. The most ambitious mathematical research program of the early 20th century had failed due to its prerequisites. What remained was the sober realization that mathematics, too, must live with limits that it did not abolish itself but rather encounters.

### The Same Wall, New Paint

This brings me back to the question that triggered this text. What does all this have to do with language models?

In January 2024, Ziwei Xu, Sanjay Jain, and Mohan Kankanhalli from the National University of Singapore submitted a paper titled “[Hallucination is Inevitable: An Innate Limitation of Large Language Models](https://arxiv.org/abs/2401.11817).” They formalize the problem in a formal world where hallucination is defined as an inconsistency between a computable language model and a computable truth function. Their central result relies on learning theory and an argument whose structure is directly related to Cantor's and Turing's diagonalization methods: language models cannot learn all computable functions. As soon as they are tasked with solving a broad spectrum of issues, they will inevitably hallucinate. There is no training method, no architectural variant, and no amount of data that can shift this limit.

A few months later, in September 2024, Sourav Banerjee, Ayushi Agarwal, and Saloni Singla presented a second, independent line of reasoning with “[LLMs Will Always Hallucinate, and We Need to Live With This](https://arxiv.org/abs/2409.05746).” They get straight to the point, explicitly relying on Gödel's first incompleteness theorem and the undecidability of the halting issue, the emptiness problem, and the acceptance problem. Their conclusion is formulated even more decisively: hallucinations cannot be eliminated through architectural improvements, data enrichment, or fact-checking because they stem from the fundamental mathematical and logical structure of these models. They introduce the concept of structural hallucination, describing the phenomenon not as an error but as a structural property.

Caution is advised at this point, as both papers use the term “hallucination” in a formally defined way that does not fully align with everyday usage. The statements do not mean that every second sentence from a language model must be false. They mean that a perfect truth machine that reliably provides correct answers in any domain is mathematically impossible. Further improvements can be expected regarding the everyday hallucination rate. However, nothing of the sort can be expected regarding its fundamental eliminability.

Specifically, this means the following: Even if a language model were trained on a perfect dataset and had an unlimited number of parameters, there would still be questions for which it would provide plausible-sounding but incorrect answers. Furthermore, Banerjee, Agarwal, and Singla show that every stage of the processing, from the compilation of training data to fact retrieval and actual text generation, has a non-zero probability of error. These errors accumulate but cannot be completely avoided at any point. This is not an empirical observation that could be refuted by better methods. It is a result of the same kind as the impossibility of building a general halting checker.

It is remarkable that both papers reach the same conclusion independently, albeit through different paths. Xu, Jain, and Kankanhalli argue via learning theory. Banerjee, Agarwal, and Singla argue via classical computability theory. It is the same wall that both groups are running into. Anyone familiar with the results of Gödel and Turing will recognize it immediately.

This also gives a different flavor to the popular demand to simply solve the problem of hallucinations with more training data or larger models. It is not wrong in the sense that larger models would not improve. It simply overlooks that scaling in this case does not address the problem described by the cited papers. A function that is not learnable does not become learnable through more parameters. A question that is undecidable does not become decidable through more data. The limit is not a matter of scale but a matter of the nature of the procedure.

### An Old Insight in New Guise

Hilbert did not completely abandon his program until the end of his life. Even after Gödel and Turing had refuted its prerequisites, he maintained the belief that the scientific spirit would ultimately be able to answer every question. Today, this attitude sounds almost touching, certainly like a historical anecdote. One can indulge in looking at it with amusement and smiling indulgently at a great mind who refused to accept a proven limit.

However, with all the amusement, one should ask oneself what the research landscape of the coming years will tell us about ourselves. We are building systems with considerable effort that absorb more and more language from the world and distribute it across more and more parameters, with the discernible hope that hallucinations will eventually disappear if one just scales long enough. The mathematics of the last 90 years suggests that this path will hit the very wall that Hilbert encountered. Not because the models are too small, but because what we hope for from them is not attainable in the chosen form.

Perhaps in a few years, it will turn out that the current approach to AI was not the right one. Perhaps not. But I can only recommend asking the question whether we are currently ignoring a proven limit for the second time. The old study materials might not be as dry as they seemed back then after all.
([mro](mailto:manuel.masiero@heise.de))
