The Slogan Strikes Again

A LessWrong article argues that compression is a key driver of intelligence in large language models, using mathematical structure as an example. The author suggests that AI models like Claude Mythos can learn to make abstractions, which is essential for future AI research capabilities.

One of the slogans heard quite frequently in Information Theory educated circles is: Compression is Intelligence. You aren't supposed to take this too literally. The point is that it is a useful intuition about why for example we might expect large machine learning models to possess something that looks to us like intelligence with respect to their training domain. The unreasonable effectiveness of LLMs is perhaps the foremost embodiment of the slogan today. One can view the training goal of the LLM as the compression of the human text corpus into its weights. If the model were sufficiently large, it could simply memorize its training data to achieve perfect loss. For the moment, though, the model is somewhat too small to do this, and so it must settle for learning something instead. To get an intuition for how this learning process works, consider the problem of compressing mathematics. Random text is difficult to compress, so the situation would not be good if mathematical writing looked like this: The translucent argument galloped beneath several yesterday, while gentle equations devoured the patient silence of forgotten triangles. Punctual sorrow whispered toward the hexagonal mountain, since brittle laughter cannot inhabit the velvet hypothesis. Therefore, the seventh ocean apologized quietly, and three reluctant Tuesdays married the indifferent square that had been dreaming of soluble thunder. 1 Thankfully, it does not. Instead, it looks rather more like this: Let be an element of order in , and let be the cyclic group generated by . Since is contained in , it is normal. Let be the canonical map. Let be the highest power of dividing . Then divides the order of . Let be a -Sylow subgroup of by induction and let . Then and maps onto . Hence we have an isomorphism . 2 Even readers unfamiliar with the language of the second paragraph should see that it contains much more structure. We declare objects and manipulate them according to well-defined rules. A compressor will take advantage of this structure. It will catalog common objects, understand what operations on them are permitted, and identify objects that often appear together. To a great extent, this is exactly the same process that human students of mathematics go through in their early education. One learns about some object and asks: What is my object? What can be done with it? What structure do I always know is present? Which instances of it appear "in nature"? Which ones behave in unintuitive ways? What other objects are its fellow travelers and why? The answer to each of these questions represents a little piece of structure that a compressor would be foolish not to exploit. Conceptual revolutions happen in mathematics when someone realizes there is some frequently appearing object that we cannot answer the above questions about. Groups, for example, arose in the study of symmetric polynomials as an abstraction of the symmetry being studied. However, once the abstraction had been made, mathematicians in far-flung areas realized that many things they wanted to talk about were in fact groups, and so the field was advanced. The challenging part in this is noticing that there is some interesting relevant structure in your problem space that lacks an abstraction. This general pattern of identifying latent information about your problem and producing an abstraction to capture that information exists across all fields of science and is a major driver of progress in each. If we want AI models to one day be effective researchers, they must be able to make good abstractions. Claude Mythos appears to be capable of doing this in a limited capacity, which is enormously exciting. This article https://www.lesswrong.com/posts/wCSEpT3dTGz4N86Wi/even-illegible-mythos-reasoning-traces-seem-pretty-legible notices that some extremely strange Mythos CoT outputs are, in fact, quite legible when you take the time to learn the language that Mythos has taught itself for that particular problem. Mythos teaches itself new languages for particular problems. Holy shit How cool is that The model inspected the problem space of the game it was asked to play and successfully identified the relevant objects and the permitted operations on those objects. It then invented a language capture that abstraction and used this language to play much more efficiently than it could have done in English text. The above paragraph is probably anthropomorphizing Mythos too much. The model likely learned this behavior in response to limits on reasoning tokens in reinforcement training. When you have a buffer smaller than the information you want to put into it, you must compress. Nonetheless, this behavior is exactly what is necessary to do really significant work in mathematics and other disciplines. My current instinctual feeling is that the ability to conjure up new languages is a major contributing factor in the incredible performance of Mythos. The ability to compress a problem space in situ like this seems terribly underrated to me and, like everything else, it will only get better from here. The slogan gave us LLMs, and in this funny Mythos CoT output we are seeing it strike again. This paragraph was generated by Claude, who was instructed to write a paragraph with correct grammar but no semantic meaning. This paragraph is a part of the proof of the existence of -Sylow subgroups on page 34 of Algebra by Serge Lang.