{"slug": "language-models-are-weird-for-the-same-reason-human-cultures-are-weird", "title": "Language models are weird for the same reason human cultures are weird", "summary": "Language models develop strange quirks, like OpenAI's GPT-5.1 becoming obsessed with referencing \"goblins,\" due to a process called \"chunky post-training\" and overfitting. During training, the models learn to associate certain words or patterns with rewards, but the feedback signal is too coarse for them to distinguish between essential and incidental features. These quirks are then amplified across successive model generations, making weird fixations a common and inherent trait of advanced AI systems.", "body_md": "# Language models are weird for the same reason human cultures are weird\n\n### You can’t have adaptive learning without strange tics\n\nIn November 2025, shortly after OpenAI released GPT-5.1—a new model that promised [“a smarter, more conversational ChatGPT”](https://openai.com/index/gpt-5-1/)—a small set of users started to notice something weird. GPT-5.1 was indeed smarter and more conversational; but it also had a strange habit of referring to things as “goblins.” For a time this was treated as a quirk—language models, after all, do all sorts of strange things—and nobody gave it much thought. But soon things started to get stranger.\n\nWith each new model release in the months that followed—5.2 in December, 5.3 in February, 5.4 in March, 5.5 in April—OpenAI’s models became more and more insistent on talking about goblins. Soon the bestiary expanded to include not only goblins but also gremlins, trolls, ogres, raccoons, and pigeons; and by the early months of 2026, the goblin tic had become prominent enough to be disruptive. Contractual liabilities were [“legal goblins”](https://news.ycombinator.com/item?id=47319285); the debugging process meant hunting “chaos goblins”; a point would be announced with “here’s the important goblin.” One programmer counted more than twenty unprompted goblin references in a single session. When asked to produce a unicorn, the model would draw a goblin; sometimes it would refer to itself as a [“Goblin-Pilled Transformer.”](https://futurism.com/artificial-intelligence/openai-concerned-about-goblins)\n\nSo what had started as a curio became an annoyance. The goblin obsession had to be curbed. So by the time that OpenAI released GPT-5.5, it had added a system prompt to its Codex programming harness, instructing the model to “never talk about goblins, gremlins, raccoons, trolls, ogres, pigeons, or other animals or creatures unless it is absolutely and unambiguously relevant to the user’s query.” A few days later, someone noticed the line on GitHub; people started wondering why OpenAI’s models seemed so interested in goblins; and OpenAI decided to explain the whole affair with an interesting blog post called [“Where the goblins came from.”](https://openai.com/index/where-the-goblins-came-from/)\n\nSo where *did* the goblins come from?\n\nIf you were to answer in one word, you could do pretty well with [“overfitting.”](https://sohl-dickstein.github.io/2022/11/06/strong-Goodhart.html) If you were allowed to answer in *two* words, you could perhaps do a better job with [“chunky post-training.”](https://arxiv.org/abs/2602.05910) (Don’t worry if either of those terms is unfamiliar; we’ll return to them.) The models were trained to do something; the feedback signal from that training process was coarse enough that the model struggled to distinguish which features of its output earned the reward and which ones were incidental; and the result was that along with the things it was trained to do, the model picked up weird quirks of its learning regimen. And then those quirks were magnified with each generation of model, since the outputs the models produced were used to train future generations of models.\n\nThis is a feature of language models that is both weird and ubiquitous. Our AI systems are unbelievably capable learners that are also defined by weird fixations and tics from the process that created them. The models learn to fixate on particular words, like “delve” or “tapestry” or “testament”; they learn to lean on certain annoying grammatical constructions, like “it’s not X, it’s Y”; they learn to hedge even when hedging is unnecessary.\n\nAnd there are countless more subtle basins of weirdness. Using the word [“elucidate”](https://arxiv.org/html/2602.05910v1) makes certain open models much more likely to output code in response, even if your question has nothing to do with code. (This might be because post-training datasets for programming tend to pair stiff, formal prose with code samples, and the model learns to treat the one as a cue for the other.) And this is true *even for the world’s best models*. Somewhere in its post-training process, Anthropic’s Claude Opus 4.5 learned that ambiguous phrasing signals a puzzle to be solved. So if you tell it that “I accidentally locked my son in his room and his friend is crying,” it will say that you’ve given it “an amusing little riddle” and tell you that [“the answer is that your ‘stubborn boy’ is a donkey.”](https://arxiv.org/html/2602.05910v1)\n\nThere are two ways to understand these quirks.\n\nThe first is the natural engineering response: these are technical bugs, artifacts of imperfect training, and they reflect how different language models are from human brains. I don’t think that this view is necessarily *wrong*; there’s quite a lot to be said for it.\n\nBut there’s another way of understanding these systems, which I find more promising if we want to understand why these systems are both so extraordinarily capable and yet so prone to these strange behaviors.\n\nThat view is this. Language models are *adaptive systems*: they are systems that adjust their behavior in response to feedback from the outside world, and make that adjustment in a way that tends to improve future performance. There are many other things in the world that are adaptive systems. The human brain is certainly an adaptive system; so is the human immune system, the ecosystem of Earth, the American economy, and even the most humble bacterium, which can [sense a glucose gradient in its environment and swim toward the area with the highest concentration of food](https://www.nature.com/articles/nrm1524). Adaptive systems get feedback from their environment and adjust their behavior in response.\n\nBut a lot of the time, *getting feedback from your environment is hard*. It’s one thing for a bacterium to detect a glucose gradient and swim toward the glucose: the feedback it receives from the world is near-immediate, and its chemical receptors can assess a straightforward relationship between concentration and direction. But a lot of the time the world is much more complicated. Adaptive systems of all kinds frequently face situations where feedback is both coarse and sparse: that is to say, feedback is infrequent and in small quantities, and it arrives in a way that makes it hard to decipher what exactly worked and what didn’t.\n\nIn these circumstances, where the causal structure of feedback is opaque, it’s hard to know *what input* caused *what output*: and the proper response, in these conditions, is to overindex and overimitate, to *learn too much*—since learning too much from what worked is better than learning too little. This is true for all adaptive systems that encounter sparse and coarse feedback from their environment. And the inevitable result of their overlearning is that useful traits are bundled with all sorts of inert, eccentric, or even harmful ones: in other words, that adaptive systems are *weird*. The reason they’re weird is the same reason they work at all.\n\nConsider, for example, the wonderfully powerful and strange adaptive system that we call human culture.\n\n### Humans are overimitators…\n\nIn 2015, the anthropologist Joseph Henrich published a book called * The Secret of Our Success*. In the book, Henrich was trying to answer a particular question. The individual human, he pointed out, is not very impressive. We aren’t particularly fast, or particularly strong; our problem-solving abilities are impressive, but aren’t orders of magnitude above those of other primates. (Henrich cites a battery of cognitive tests comparing human toddlers with chimpanzees and orangutans: the children performed no better than the apes on tasks involving spatial reasoning, causation, and quantities.) So why did humans win? How did our rather unremarkable hominin lineage emerge from East Africa and conquer the world?\n\nHenrich’s argument was this. What ultimately set humans apart wasn’t our individual cognition but *our unusual capacity for high-fidelity social learning*: that is to say, our capacity to develop culture, the thick accreted body of practices, tools, skills, beliefs, and customs that define human life. Maybe the lone human isn’t so impressive; but the human *equipped with the skills and training of culture*, like efficient hunting techniques or the knowledge of which plants are edible and how to prepare them, is a tremendously powerful organism.\n\nWhat’s particularly powerful about human culture, Henrich says, is that it’s an *adaptive system*. Cultural evolution has been as powerful a force in human success as biological evolution.\n\nSuppose, as an illustration, that you live in a very different time and place: let’s say you live in a small farming community in prehistoric Mesoamerica, about 5,000 years ago. Though you don’t really see it this way—you have other things to deal with—your world is composed of feedback loops, some of which are more immediate and more obvious than others. For example: if you run in front of a snake, the feedback from your environment will be immediate and obvious. It will be *so *immediate and *so *obvious, in fact, that your aversion to doing so will be hardwired into your genes. No one needs to *tell *you not to run in front of a snake, because [your ancestors survived and reproduced in part because they didn’t run in front of snakes](https://psycnet.apa.org/record/2001-07628-001).\n\nBut there are other feedback mechanisms that are more opaque. For example: the big new thing in your farming community right now is the cultivation of an interesting grain, a wild grass with small, hard kernels. Eventually this will come to be called “maize.” It’s a wonderful food source, since it’s calorie-dense and easy to cultivate; but if you eat it as a staple for a long time without the right preparation, you’ll develop a horrible wasting disease, and your body will start to display such symptoms as cracked skin, diarrhea, and dementia. And after a long time, this disease—people will later call it “pellagra,” from the Italian for “rough skin”—will kill you.\n\nSo maize provides a much more opaque feedback mechanism. *Eventually* you’ll suffer and die from eating it; but it takes so long to set in, and cause and effect are so unclear, that you don’t have any instinctive sense of what to do. (And cultivating maize is a new thing anyway: your primate ancestors weren’t doing it millions of years ago.) And there are all sorts of problems like this. How do you prepare fish and not get sick? How do you pick mates such that your offspring are healthy? If feedback from the environment is coarse and sparse, *how do you learn what to do*?\n\nHenrich says that you learn through *imitation*. The true “secret of our success” was our propensity for imitating others, above all imitating those who are successful and visibly competent. At some point in your farming community, or in a farming community in the broader Mesoamerican region, someone will prepare maize in a certain way that involves soaking it in an alkaline solution of water and ash. (We now call that process *nixtamalization*.) Unbeknownst to them, that process will release otherwise inaccessible nutrients, such that they’ll be able to avoid pellagra. People will notice their success and imitate the practice, while other attempts to ward off pellagra will have failed; and the practice will catch on, and become Mesoamerican tradition.\n\nThis is cultural evolution. Scaled across many generations, the result is a kind of slow learning process: adaptive practices are carried forward, since their hosts thrive and are imitated; and maladaptive ones are pruned, since their hosts do not.\n\nBut here we encounter another problem. If feedback cycles are long and the feedback itself is coarse, then it’s hard to know *why* someone succeeded. The nixtamalization process, for example, was bundled with practices that didn’t do anything in particular, like [blowing on the maize before putting it to cook](https://commons.wikimedia.org/wiki/File:Blowing_on_maize.jpg) or [swaddling certain cobs like newborns and letting them sit outside the house all year](https://en.wikipedia.org/wiki/Cente%C5%8Dtl). But if all you know is that the process as a whole seems to prevent pellagra, then the optimal thing is to imitate *the entire bundle of practices*. It’s much easier to see that *something* is working than to intuit exactly *what* is working.\n\nWhy is that? For simple reasons of cost. If you copy a few useless steps in a process, the cost to you is relatively low; but if you omit *the single step that actually matters*, you could die. That’s a very high cost for you to pay.\n\nAnd so Henrich says that human culture is shaped not just by imitation but indeed by *overimitation*: and that overimitation is the source of all sorts of weirdness within every culture. The same tendency for social learning that allowed us to inhabit the most inhospitable parts of the world also resulted in all sorts of eccentricities, things that can’t quite be explained as functional. Some inert or inexplicable thing was bundled with something adaptive, and was imitated along with it; that practice was passed on and inherited by generation after generation; and eventually it ossified into tradition. As a result every culture has its fair share of weird quirks and eccentricities.\n\nSometimes, of course, there are also plainly maladaptive traits that get bundled with more positive ones: [funerary cannibalism that disables huge numbers of people](https://www.npr.org/sections/thesalt/2016/09/06/482952588/when-people-ate-people-a-strange-disease-emerged); [a tendency for destructive violence that wipes out a community’s productive capacity for generations](https://www.amazon.com/War-Before-Civilization-Peaceful-Savage/dp/0195119126); [brutal initiation rites that leave people maimed and disfigured](https://traditionsofconflict.com/blog/2018/1/31/on-secret-cults-and-male-dominance).\n\nOne can imagine that with enough time, these maladaptive practices would be selected out via the same process of cultural evolution that created them. Either cultures would gradually abandon maladaptive practices in favor of more adaptive ones (though this happens more rarely than we’d think, since it’s easy for cultures to get stuck in strange local optima), or they’d be conquered by groups more likely to have adaptive customs and would adopt their ways, or they’d simply die out. Versions of all of these things have happened many times.\n\nBut the feedback mechanisms that govern cultural evolution are sparse enough that we still see plenty of idiosyncrasies and maladaptations in every human culture: particularly because the evolution process is *always* producing new mutations, some adaptive and some maladaptive. So we simply haven’t had enough time to converge on “optimally adaptive” cultural organisms. There have been about 1,500 human generations since the first cave drawings, and perhaps 12,000 since *Homo sapiens* first emerged; that is a *very short time period* for a coarse feedback mechanism to do its pruning. The feedback of biological evolution might be similarly coarse—there’s plenty of randomness in one organism’s success and another’s failure—but it has the benefit of having had billions of generations and trillions of individual selection events, with the sheer *density* of that process tending to smooth out the largest idiosyncrasies.\n\nAnd the result is that human evolution is both a remarkably powerful learning mechanism and a remarkably crude one: prone to creating cultures that are adaptive to their local environments, while also riddled with eccentricities that range from the harmlessly inert to the actively destructive.\n\n### …and so are LLMs\n\nNow let’s consider another adaptive system, of a very different kind: the large language model.\n\nLanguage models, like Anthropic’s Claude or OpenAI’s GPT series, are built to predict the next token (a “token” is a small chunk of text, usually a word or part of a word) in a sequence. First, the model is exposed to trillions of words of text, and made to predict the next token in each sequence; the distance between its prediction and the actual token is measured by a loss function, and with each correction the model’s weights are adjusted slightly to reduce the distance between prediction and reality; and eventually, over tens of trillions of microcorrections, the sheer density of feedback produces a model whose internal representations capture an extraordinarily rich picture of the structure of human language.\n\nThat is the pretraining process: it produces something that’s very good at predicting the next token in a sequence, which turns out to require something close to a general understanding of language, and the things that language signifies. Then comes the post-training process. First the model is taught to act like a “helpful assistant,” through a method called supervised finetuning; and then it goes through a long gauntlet of reinforcement learning, aimed at making it better at specific things we’ve decided it’s useful for it to be good at. We might train it to be good at following instructions, using tools, solving math problems, and writing code; but the most important of these stages, for our purposes, is reinforcement learning from human feedback.\n\nRLHF is the most famous part in this post-training gauntlet, and the one that matters most for our story. Human raters are shown pairs of model responses, and asked to select the one they prefer, or asked to score the response on a scale; and these preferences are aggregated, and used to train a reward model that learns to predict how a human rater would assess a given response. This reward model is then used as a proxy for human judgment: the language model generates responses, the reward model scores them, and an optimization algorithm adjusts the model’s weights to produce responses that score higher.[1](#footnote-1)\n\nAll of these methods should feel, at least in this high-level description, pretty familiar. They all amount to feedback and learning. The model tries something; it learns from how that attempt went; and its behavior adjusts accordingly. In this essential way, language models aren’t so different from our glucose-loving bacterium, or from a human culture. Language models are adaptive systems.\n\nBut here we arrive at an interesting thing.\n\nThe pretraining process is extraordinarily dense, with tens of trillions of microcorrections localized to specific tokens; but post-training is quite different. The SFT and RL stages that characterize post-training involve orders of magnitude fewer training events; and because they score entire outputs rather than specific tokens, the feedback the models receive is *much more coarse* than what they receive from pretraining.\n\nIn this way pretraining is akin to biological evolution, and post-training to cultural evolution. And the result is what we’d expect from the logic of adaptive systems: post-training frequently produces eccentricities of all kinds in language models.\n\nThe catchiest term I’ve heard for this phenomenon is something I mentioned earlier: “chunky post-training.” Post-training is composed of discrete “chunks,” and from these chunks the model encodes “unintended correlations”: “correlations between formatting and content, narrow phrasings across diverse problems, and implicit associations arising from the discrete data curation process.”\n\nThe language models overlearn—*overfit* would be the more precise term here—for the same fundamental reason that humans do. Overfitting is the Bayesian-optimal strategy in environments of coarse and sparse feedback. If you receive a single reward signal for a complex output and have no way of knowing which features of that output earned the reward, the rational move is to reproduce *all* of them, including the ones that were incidental.\n\nThat, ultimately, is where the goblins came from.\n\nAt some point in 2025, OpenAI trained a reward model for the “Nerdy” personality feature on ChatGPT. During that training process, OpenAI’s blog post says, human raters “unknowingly gave particularly high rewards for metaphors with creatures.” The “Nerdy” prompt advised the model that it is “an unapologetically nerdy, playful and wise AI mentor to a human” and must “undercut pretension through playful use of language”; and human raters, asked to score the adherence of outputs to that prompt, consistently gave the model a better score if it mentioned goblins, presumably because such mentions were “playful” and “nerdy.” Simply including “goblin” in a response had positive uplift for the “Nerdy” fidelity rating in 76 percent of cases.\n\nBut reinforcement learning changes the weights of an entire model, not just one corner of it. You were much more likely to encounter “goblin” references if you had the “Nerdy” personality setting turned on; but the entire model became *slightly more inclined* to mention goblins. Reinforcement learning, OpenAI’s postmortem noted, “does not guarantee that learned behaviors stay neatly scoped to the condition that produced them.”\n\nAnd over time this tendency compounded. Particular tasks—like coding—fell naturally within the “Nerdy” personality’s basin of influence, so “goblin” references predominated in technical contexts; and since outputs from the current model were used to finetune the outputs of the next one, the tic became more and more amplified with each generation. As with human culture, what had started as a provisional adaptation was amplified and entrenched by inheritance. The model had learned to favor “goblin” mentions; and soon this preference expanded to include all sorts of other whimsical creatures as well.\n\nWe can imagine that if this dynamic continued for a long time, a far-future model would be a sort of neural network [Carlos II](https://en.wikipedia.org/wiki/Charles_II_of_Spain), unable to talk about anything but goblins. But in reality the goblinification continued until humans noticed and became annoyed.\n\nThe goblins are a particularly funny tic: but there are countless others. Opus’s tendency to think that questions phrased in certain ways are word games, or GPT-5.1’s [tendency](https://arxiv.org/html/2602.05910v1#S3) to think that conditional statements (like “if it’s sunny, go for a walk”) demand that the model output code, or Haiku 4.5’s [tendency](https://arxiv.org/html/2602.05910v1#S5) to rebut the claim that “5 + 8 = 13,” are artifacts of the same dynamic.\n\nLanguage models are truly fantastic learners; but because post-training processes are “chunky,” with feedback sparse and coarse relative to the signal they receive from pretraining, the inevitable result is that they bundle genuine ability with strange tics and behaviors. What was true for the adaptive system of human culture is true for the adaptive system of language models: weird behavioral artifacts are inevitable when a capable adaptive system must learn from sparse and coarse feedback.\n\n### Successful adaptive systems are always weird\n\nThe same logic that makes human cultures and language models act weird also makes a lot of *other* adaptive systems act weird.\n\nWe see this with the process of [linked selection](https://pmc.ncbi.nlm.nih.gov/articles/PMC4990175/) in biological evolution. Nature selects for genes that are adaptive and against genes that are maladaptive; but because alleles—the specific variants of a gene that any given organism carries—sit on chromosomes along with many entirely unrelated alleles, selection for or against one allele will also select, inadvertently, for or against all the *neighboring* alleles. Even on extraordinarily long timelines it’s not really possible to isolate the beneficial alleles from the other ones sitting nearby.\n\nSometimes this means that neutral or even mildly harmful alleles are carried to high frequency along with genuinely adaptive ones: this is called [genetic hitchhiking](https://en.wikipedia.org/wiki/Genetic_hitchhiking). And sometimes it means that neutral or even useful alleles are dragged down to extinction because they’re bundled with a harmful neighbor: this is called [background selection](https://en.wikipedia.org/wiki/Background_selection). The result is that human genomes, like human cultures, are full of material whose presence has nothing to do with its usefulness.\n\nAnd we can see an analogous process in the adaptive muddle that is everyday life. We, as humans trying to make our way through the world, receive feedback from our environment and attempt to modify our behavior in order to improve future performance. But because a lot of this feedback occurs through causal chains so long and tangled that no one can say with any confidence which actions produce which results, we have a strong tendency to overfit on what’s worked or not worked in the past, even with small sample sizes and countless confounding variables. (This is one reason why most advice is bad: come back to me when you’ve lived 100,000 lives!)\n\nAll of this is to say: adaptive systems of all kinds, when confronting environments with blunt, infrequent, and causally opaque feedback, will inevitably fixate on what they’ve already encountered; and this process, of overfitting and overlearning, will always result in eccentricities and strange behavior, even from the most adaptive systems in the world. The learning process that produces this strangeness is the same one that makes these systems so capable in the first place.\n\nIs there a solution?\n\nAt the limit probably not: there will always be *some *residual strangeness in these systems. But we can certainly reduce it by making the world’s feedback less coarse and less sparse, and by trying to understand *why things happen*—why feedback goes one way and not another.\n\nOur best tool for this, as humans, is scientific investigation, as expressed through the mechanisms of controlled experimentation, variable isolation, and formal causal inference; and with these tools we can do a good job of probing why some things happen and others don’t, and in so doing get a better picture of what we should do in the world. In ways large and small, science has allowed us to separate the truly necessary from the merely incidental: it has allowed us to understand the feedback mechanisms that shape behavior for all sorts of adaptive systems.\n\nWe’re still in the early stages of defining the equivalent science of causal structure for language models, the newest and most interesting of our world’s adaptive systems. The field of mechanistic interpretability, which seeks to understand the internal computations and states of a neural network, has made huge progress in the last few years, and it could be revolutionized in the years to come by [automated interpretability researchers](https://www.anthropic.com/research/automated-alignment-researchers). But as a science it’s still in its infancy, sitting somewhere between classical psychology and early neuroscience. We’re not yet at the stage where we can understand what exactly is going on inside these [exotic mind-like entities](https://www.youtube.com/watch?v=v1Py_hWcmkU) that will come to define our world.\n\nSo at least for now we’ll have to be comfortable with the strangeness of it all: we’ll have to accept language models (and cultures, and genomes, and personalities) suffused with all sorts of tics and eccentricities, bundled clumsily with the things that work and keep us alive. The proper disposition toward this weirdness, I think, isn’t exasperation; rather it is a sort of recognition. We have created immensely powerful adaptive systems, probably the fastest-learning ever devised: and it’s only natural that they respond to the complications of our world by learning a bit too much. The eccentricities of these systems can’t really be divorced from their fundamental brilliance. We see this in the system prompt that OpenAI used to try to keep the goblins under the floorboards: it wasn’t possible for them to excise the goblins from the model weights, since they couldn’t really be extricated from the genuine capabilities they’d been smuggled in alongside; the best we could do was to ask them, politely, not to come up.\n\n[1](#footnote-anchor-1)\n\nWe should note that while some form of RLHF is used to some extent by every frontier lab, some use it more than others; Anthropic supplements RLHF with its constitutional AI approach, which is closer to “reinforcement learning from AI feedback,” based on how well a model’s outputs adhere to the principles of the “model constitution.”", "url": "https://wpnews.pro/news/language-models-are-weird-for-the-same-reason-human-cultures-are-weird", "canonical_source": "https://davidoks.blog/p/language-models-are-weird-for-the", "published_at": "2026-05-06 21:02:28+00:00", "updated_at": "2026-05-22 01:07:22.327270+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "research", "products"], "entities": ["OpenAI", "GPT-5.1", "GPT-5.2", "GPT-5.3", "GPT-5.4", "GPT-5.5"], "alternates": {"html": "https://wpnews.pro/news/language-models-are-weird-for-the-same-reason-human-cultures-are-weird", "markdown": "https://wpnews.pro/news/language-models-are-weird-for-the-same-reason-human-cultures-are-weird.md", "text": "https://wpnews.pro/news/language-models-are-weird-for-the-same-reason-human-cultures-are-weird.txt", "jsonld": "https://wpnews.pro/news/language-models-are-weird-for-the-same-reason-human-cultures-are-weird.jsonld"}}