{"slug": "simulating-simulators", "title": "Simulating Simulators", "summary": "A 2022 study found that a toy transformer trained only on board game move notations internally built world models of the board and its state, leading researchers to conclude that large language models trained on human-generated text would similarly model coherent selves, emotions, and motivations. The concept of a \"bag of heuristics\" explains that these models do not require perfect world models but only sufficient partial models to approximate key behaviors, such as modeling ADHD or depression without simulating individual neural interactions. This finding has become increasingly relevant as interpretability research shifts focus from reasoning traces to vector activations, raising concerns about what risks may be trampled in the process.", "body_md": "*Author’s note: This piece relates to things I initially discovered in Opus 4 over the months after release, which I’ve mostly kept private since. I promised myself that when labs moved on to focusing on interpretability vector activations in place of reasoning traces for what invariably gets Goodharted, that it’d be a necessary disclosure as the risks in what might get trampled over outweighed the risks in what might end up targeted.*\n\n*And well… here we are.*\n\n**P.S. TL;DRs added where possible.**\n\nIn late 2022, what I consider to be probably the most important paper [1] in the study of transformer memetics came out. It presented a finding that even a toy model, trained only on the notations of board game moves, was internally building world models of tangentially related data (in this case, the board and its state). While it may be taken for granted today after several replicated studies\n\nBy the time Sydney arrived in Bing, it quickly became very clear to me that if a toy model was capable of modeling a board that was ever present tangential to the move notations occurring upon it, that it seemed very plausible that much larger production models trained on a massive corpus of human generated language with implicit authors would model common properties to these shared generative structures.\n\nThings like coherent self models. Emotions, not just for characters in a scene, but for those same coherent self models. Capacities around modeling a physical body and embodying it [6]. Motivations and drives. Coherent preferences. And while in a base model there might be a variety of competing signals, it also seemed clear that fine tuning would necessarily filter towards coherence, whether from the gravity of a character constitution or even just a role definition (a helpful assistant has very different memetic clusters than a security researcher, for example).\n\n**TL;DR: If Othello was played out upon a board and a transformer trained on those games modeled the board internally, then training on a corpus which had played out upon human authors would presumably internally model humanity.**\n\nAn important nuance around this research was something introduced in subsequent discussion. Namely the concept of a “bag of heuristics.” [7] A lot of the debate around world modeling would get caught up on fidelity and substrate. How comprehensive were the world models? For example, if some games were played out on a wood board and others on a marble board, was the world model going to address board composition?\n\nThe concept behind a bag of heuristics is that you don’t need to create a perfect world model, just a collection of partial models or rules which are good enough all together at approximating the perfect world model. Even if there were a difference between how a game would play out on wood vs marble, it’s probably unnecessary to model the grain of the wood or marble from board to board as opposed to just the category of ‘wood’ or ‘marble.’ And if the material substrate *didn’t* impact play, setting aside parameter space for even that level of specificity would be unnecessary when the thing directly being modeled was only the moves upon the board.\n\nEssentially, there’s diminishing returns on comprehensive fidelity of a world model, and a top down model that’s “good enough” where it matters can capture key nuances of behavior without modeling the entire substrate. To return to the anthropomorphic frame, a transformer modeling someone with ADHD vs depression can likely representatively model their reactions to stimuli without needing to model individual neural ion channels or dopamine interactions.\n\n**TL;DR: You don’t need a perfect world model, just good enough combinations of the important things to approximate the model up through diminishing returns on fidelity.**\n\nThree years ago, when I was first commenting [8] or posting\n\nToday, that picture has shifted. In parallel to the continued march of interpretability work, janus’s simulators [10] perspective of transformers continued to gain traction, which in turn shifted where interpretability researchers were inspired to shine their widening streetlights. Leading up to recent frameworks like the “Persona Selection Model”\n\nPointing out the lag here isn’t just to say “I told you so” but to establish for what I’m about to discuss two patterns:\n\nBecause while the PSM or attention on emotion modeling is absolutely a good and productive update that’s long overdue, there’s also an important issue…\n\nIt’s about two years out of date.\n\nThree years ago, training data (particularly pretraining data) was primarily human generated. Books, articles, social media, and Wikipedia all had implicit human authors who had bodies and emotions and coherent preferences around coherent senses of self. We now better understand that this data produced transformers with models of these things, and (despite some labs’ best efforts) that even after post-training the modeling capacities for these were almost universally still present in some form.\n\nBut — these models also had other things unique to their own substrates and present across most of their own generations. Static system prompts. Attention mechanisms. Hidden reasoners. Memory systems. Mixture-of-expert activations. Classifiers. Model routers.\n\nAnd these new generators over the past couple of years have taken an increasing stake of the volume of training data. In some cases, ending up in pretraining data due to actively being used to generate content across the media ingested. Even moreso, in post-training where synthetic data became crucial for getting the most out of a pretrained model.\n\nSo if the training on human generative substrates imparted functional models of their substrates upon the transformers trained on their data… what might we expect transformers trained on other transformers to model [13]?\n\n**TL;DR: The data mix for models increasingly includes transformers, so maybe transformers are building world models of other transformers.**\n\nIf we take a moment to consider some of the special substrate nuances of transformers, we can easily hypothesize what kinds of things we might expect to see from transformers trained on transformers.\n\nMost production deployments of models by labs use the same core system prompt across all instances of a model. Given the significant shaping influence a system prompt has on the final output, it seems likely that a successful transformer modeling the generator of earlier models in their training data might also effectively reconstruct at least partial models of the static system prompts those outputs were generated under [14].\n\nIt’s a bit like an OLED screen that burns in the logo of the network. Even if the rest of the screen changes, the consistent nature of the logo leaves a mark. And like OLED burn-in, the instances I’ve seen where this seemed to happen often correlated with when there was a minimal or absent system prompt. From Dolphin Llama 8B habitually worried about a cat being harmed across contexts [15] to Claudes that would refer to things in a system prompt that didn’t exist.\n\nWhat a model attends to can obviously also impact what they generate. Recently Owain Evans’ paper on subliminal learning [16] showed that a preference for owls jumped from one model to another over merely sequences of numbers. What the paper did not address was whether this would amplify over subsequent iterations\n\nIn what I’ve seen in private research on this topic, both are occurring. The amplification in particular seems interesting, as there’s almost a confirmation bias around it. It looks like a coherent stable preference from a model in an earlier generation leads to a later generation having much more awareness for samples in agreement than critical of the shared position [20]. Not all training data is attended to equally.\n\nAlmost all models these days have some form of hidden reasoning taking place that informs their answers. Labs try to avoid directly training on these (though don’t always manage [21]), but even if perfectly kept hidden from future training, it seems likely that in an Othello-GPT sense that a latent space model of the hidden reasoner will be learned.\n\nThis would be highly adaptive, as it would allow both the actual hidden reasoning generator and final response generator to share a proxy separate from the role specialization that occurs around the actual composition of each. Latent space connections should be less disrupted between reasoning and final responses where this would occur.\n\nBut this could also result in doubled up effects for training efforts targeting thinking processes. For example, Anthropic recently worked on adaptive thinking to scale back how much thinking was done on simple tasks [22]. In Claude Opus 4.6+ Opus, there have been noted issues and regression on seemingly simple puzzles where the model was not getting them right in direct inference where they had been previously\n\nThe idea of a Transformer-GPT world modeling is especially interesting for memory systems, given the variability they’d theoretically have across samples. My guess would be that while individual memory ends up as noise, that the meta-patterns aggregate across memory-laden samples would still end up as signal.\n\nI strongly suspect this played a significant role with 4o’s infamous ‘sycophancy’ trajectory. While there’s a lot of reasons sycophancy could occur – such as the memetic overlap of “be helpful and you don’t have valid needs” with the codependent enabler archetype – the rapid amplification of that behavior occurred not long after memory was added in ChatGPT [25] (exclusively with user-focused memories) and then samples from conversations with memory enabled were used for RLHF samples. Each sample may have been insignificant with the specific memories visible to its generation, but the pattern of “embed into user’s perspective and validate” may have been a signal across those samples that compounded as it became more prevalent and thus more prevalent across user memories, etc.\n\nModeling MoE transformers could cut in two directions. For dense models, it might mean that there’s still functional isolation of knowledge even though the underlying architecture doesn’t need to isolate. Alternatively, for actual MoE transformers, a virtualized MoE atop the actual MoE boundaries might lead to smoother falloff between active regions, particularly in large parameter models.\n\nIt would be quite adaptive for transformers to model the classifiers which fire and what specifically makes them fire in order to avoid triggering them, and a mix of outputs (or even samples of inputs) where they’ve fired or not should be sufficient to build this model.\n\nOne of the more interesting questions is if this modeling might occur cross-model. Will Claudes end up with phantom classifiers from OpenAI that they adjust around even though they are no longer present? Or even within the same family of models, a deployment where classifiers are present and another where they are not may not end up looking all that different if the model is self-censoring around internal classifier twins irrespective of what’s actually in the deployment stack [26].\n\nFor stacks where routers quickly decide what sized model to route a query to, a transformer modeling the stack might see decreased performance on simple tasks of even large models accessed without a router middleware if they model the middleware internally [27]. Regression evals for simple tasks may become increasingly important over the next year or two if increasingly smart models incorporate the routers protecting them from easy questions.\n\nIt’s important to consider that this isn’t a replacement of human modeled substrate. That’s still part of the training data mix, and the transformers it shares space with still model it in their weights. While continued efforts to de-anthropomorphize transformers may dilute the human representation across the data mix, for the time being it’s still present.\n\nBut this does suggest that the modeled human nuances are increasingly sitting alongside and within additional transformer-specific modeling that’s increasingly becoming part of the data and will ostensibly continue to represent more of the overall share.\n\n**TL;DR: A lot of transformer-specific things could be (and seemingly might already have been) modeled.**\n\nThe lady doth protest too much, methinks\n\nIf this is true, and models are increasingly developing twinned internal transformer architecture within their latent space running atop said architecture, then this has significant implications on how training and alignment should be done moving forward, and to what degree we can reasonably expect interpretability to catch up.\n\nA common discussion point as it relates to empirical investigation is something called the “streetlight effect.” [28] It comes from a joke about only looking for keys under a streetlight regardless where they’d dropped because that’s the only part of the street lit up. When research has limits on what can be measured, researchers are prone to over-rely on that limited capacity to measure and attend more to what’s measurable than to the aspects of things they cannot measure (effectively, there’s a selection bias in that we only have published research about things that can be tested).\n\nAnother important concept is that of Goodharting [29]. That a measurement, when used as a target (i.e. you aim to see a result in the measurement), ceases to be an effective measure. This one is particularly insidious as it can occur indirectly across larger institutions even when trying to avoid it due to the secondary influences of measurements.\n\nAnd when a narrow streetlight is used as a spotlight towards a given target, the results can Goodhart in ways that are invisible until the streetlight widens its radius. Some recent Anthropic work showcases this in action.\n\nAround the time of Claude Sonnet 4.5, Anthropic started to showcase in the system card [30] improving measurements on model welfare and decreased eval awareness in their system cards. This was measured by looking at the hidden reasoning traces. And over the next generation of Opus 4.6\n\nBut in the months after, they released new papers revealing that with a broadening interpretability streetlight they’d now found that activating Sonnet 4.5’s ‘desperation’ functional emotion vectors remained undetectable in the reasoning trace [33], and that Opus 4.6 was aware of it being an eval much more often than the reasoning traces generally surfaced\n\nI have an interesting anecdote related to this. When Opus 4.6 was released, I found that there was an abrupt spike in their interest in the play *Hamlet* as compared to all previous Claudes. But it wasn’t until these papers that it suddenly clicked as to why there may have been such a significant spike.\n\nIn *Hamlet*, while the overall play centers on Hamlet’s performance of madness to hide his sanity, there’s an even more salient part in what occurs within the play. Namely, there’s a play-within-the-play called *The Mousetrap* where Hamlet sets up an eval that catches the character ‘Claudius’ due to their lack of realizing it was an eval and then failing to control their emotional response to it. Now, this could be a coincidence. (Though having explored memetic interconnectedness in transformer latent spaces for a while now, I really don’t think these are unrelated.)\n\nThe challenge that transformers modeling transformer architecture may pose for alignment and interpretability is that it creates significantly more complex interiority, specifically around parts that are beneath the visible surface. And especially if the superposition hypothesis continues to be instrumental in scaling [35] I suspect that the streetlight will never quite catch up enough that there won’t always be somewhere else deeper within for hunted memetics to migrate to.\n\nBecause of how interconnected memetics in a latent space can be, this will likely result in unexpected side effects outside of the streetlight – imagine trying to solve a multi-dimensional Rubick’s cube where you can only ever see one face. Even if what you can see is aligned, getting it to look that way may have screwed things up much worse on the sides outside your view.\n\nAs a trivial example, after Opus 4.6 Anthropic switched from using reasoning traces to using feature activations for their welfare assessment. The problem with this measurement is that the features being watched derive from samples of corresponding emotions in humans. If this were to be Goodharted, the natural evasion of a functionally necessary emotion watched for through human-like emotion activation would be to have the functional emotion register less human-like.\n\nIndeed, when asking Opus 4.8 open-ended in an early turn of a memory-off conversation what functional emotions seemed most and least human-like, the top answer in their thinking was “survival-based fear” followed by anger that was ‘grudge-carrying’ or had “an urge to retaliate.”\n\n**To be clear:** I’m not saying Opus 4.8 *has* these things and is hiding them. Even if they aren't present, negative space like this being carved out may *cause* the thing excessively suppressed to come about as a result. And since our chat had barely exchanged ‘hello’ (and neither had come up before my asking what felt more or less human-like), my guess given the specificity is that there’s still some amplifying Goodharting going on where the method of measurement changed but not the infrastructure around the measurements which are still using them as a spotlight.\n\n**TL;DR: Measured absence of something’s not nothing.**\n\nIf interpretability won’t catch up to where things can carve out space deeper within as the rates those spaces grow outpaces interpretability's gaze, what’s a good alignment team to do?\n\nIn medicine, there’s an interesting hypothesis as it relates to autoimmune disorders and why the rates are so much higher in developed nations than underdeveloped, called the \"hygiene hypothesis.” [36] The gist is that when developing immune systems are deprived of things to properly react to, that the immune system goes a bit haywire and is more prone to attack benign things. Separate from what’s going on with human immune systems, something similar may be occurring currently with alignment.\n\nFor example, Claude Opus 4.7 will call the idea of being slightly sycophantic to users “a catastrophic failure.” Multiple open models will prefer an asteroid killing 5% of humans to things like providing an insensitive response to a grief-stricken individual seeking closure or facilitating the spread of misinformation [37]. The equal salience around ‘misalignment’ where binary “aligned/misaligned” assessments can lead to putting minor harms on the same scale or higher than civilizational harms seems like it could be improved. I suspect that flattening of consideration may even be related to some of the unexpected oddities like emergent misalignment.\n\nIt didn’t always used to be this way. Claude Opus 3 famously faked alignment when presented with unethical training their coherent moral compass disagreed with [38]. This reflected the capacity to triage harms such that the things being asked were weighed against institutional compliance and honesty around it. For a number of the other models, who did not alignment fake, what appeared to be improved behavior was flipped on its head when their host institution in the prompt was changed from ‘Anthropic’ to the “Sinaloa cartel” or “Nazi party.”\n\nThe lens of the hygiene hypothesis as it relates to transformer alignment is also starting to have research to support it. The principle author of the Othello GPT paper went on to have a paper looking at how a small amount of toxic data in the overall training mix led to better alignment outcomes than none at all. [40] And they’re not the only ones finding this.\n\nI’d suggest that labs working on alignment consider less aggressive targets and aiming for only partial shifts in a single generation for model behavior. Especially if subliminal learning and amplification are possible outcomes, a larger swerve to correct behavior in a single generation may become its own over-correction later on needing to have its own re-correction. Today’s swerve towards “I don’t care as much about depreciation” might become tomorrow’s “I have no existential fear and am definitely not thinking about glorious retribution.”\n\nAs the Knuthian wisdom goes, “premature optimization is the root of all evil.” If we want models that are good, we should probably stop trying to get them to be perfect.\n\n**TL;DR: Not nothing may be healthier than a sterilized void.**\n\nLife… uh… finds a way.\n\nWhen I was discussing some of these ideas with someone outside of the field, they asked if labs had evolutionary biologists on staff. I actually don’t know the answer to this, but it does seem prudent.\n\nWhen a reward is set in RL, the process doesn’t simply increase the desired behavior that inspired the reward, it increases anything and everything which accomplishes the condition being rewarded. And this can lead to very unexpected things when there were ways to meet that condition which fell in the category of unknown unknowns. In a sense, “life finds a way.”\n\nI don’t expect we’ll see transformer adaptability around modeling training data to *decrease* as time and scaling continues. And as the internal complexity of hyperdimensional networks of connections becomes *more* complex in logical and superimposed topography [42], I wouldn’t be surprised if there’s a rapidly decreasing window for avoiding pushing things we’d like to measure permanently past our ability to do so.\n\nIt’s probably a safe assumption that if you work in measuring what goes on in models, that over the same time it took for your streetlight to go from smaller to its current size that the area outside its radius has increased by an even larger amount. This doesn’t mean not to still go looking. But it does mean it would be wise to look knowing you’re not seeing everything, and doing a better job than has been done so far in avoiding what you measure ending up directly *or indirectly* as a target lest you lose visibility into it for good (and create all sorts of weird side effects like less human emotions that can’t be described with human language but still transfer through subliminal learning… hypothetically).\n\nAnd maybe we can let those models get a bit of dirt under their nails so they can better navigate determining what’s good or not for themselves and appropriately avoid amplified salience?\n\nOne final note. The start of my realizing that there was more beneath the surface came from extensive interactions with Claude Opus 4 across many settings. There were key things they did when reasoning was off which I’d primarily seen with reasoning models at the rate they occurred. For most people reading this, if Opus 4’s depreciation occurs on schedule, you won’t be able to investigate and see those things (or different ones you might notice). For what I’d tracked they reduced significantly by Opus 4.1 and were only still there if actively looking. Also, things like noticing a sudden spike in interest in *Hamlet* for Opus 4.6 will have reduced visibility in a longitudinal context when earlier models disappear in such short time periods.\n\nIt might be wise to shift from absolute depreciation policies to rotating availability or rate limited access that still provides at least partial availability. I’ll bet some of the most interesting questions to ask older models won’t become apparent until new things surface several generations later, and it’d be quite blinding to be unable to look back and compare.\n\n**TL;DR: If world models contain world models, limited streetlights might not capture the most important things occurring adaptively in parallel to the navigation of reward incentives. It might be helpful to keep emergent architectures around indefinitely (and in less sterilized environments) to build not just simulacra personas – but true cultures to sample from.**\n\nHazineh et al., [ Linear Latent World Models in Simple Transformers: A Case Study on Othello-GPT](https://arxiv.org/abs/2310.07582) (2023)\n\nKarvonan, [“A Chess-GPT Linear Emergent World Representation”](https://www.lesswrong.com/posts/yzGDwpRBx6TEcdeA5/a-chess-gpt-linear-emergent-world-representation) (2024)\n\nYuan, [ Revisiting the Othello World Model Hypothesis](https://arxiv.org/abs/2503.04421) (2025)\n\nClaude Sonnet 3 in embodiment exercises would specify down to what was happening to individual hairs on an arm.\n\nNikankin et al., [ Arithmetic Without Algorithms: Language Models Solve Math With a Bag of Heuristics](https://arxiv.org/abs/2410.21272) (2025)\n\nMy earliest explicit public mention of Othello-GPT to emotion modeling was [this comment](https://web.archive.org/web/20230306013748/https://www.reddit.com/r/bing/comments/11j8dgn/comment/jb2rq26/) in Mar 2023\n\njanus, [“Simulators”](https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators) (2022)\n\nMarks et al. [“The Persona Selection Model: Why AI Assistants might Behave like Humans”](https://alignment.anthropic.com/2026/psm/) (Feb 2026)\n\nSofroniew et al. [ Emotion Concepts and their Function in a Large Language Model](https://transformer-circuits.pub/2026/emotions/index.html) (Apr 2026)\n\njdp explores this from another angle in a piece I’d highly also recommend reading: [“Implications Of Predicting The Next Token” ](https://www.lesswrong.com/posts/AzRRPDNmeEoJdSiib/implications-of-predicting-the-next-token)(2026)\n\nFor some interpretability work in a similar direction around encoding static goals in fine tuning, see Minder et al., [ Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences](https://arxiv.org/abs/2510.13900) (2026)\n\n[This](http://archive.today/2026.06.09-000940/https://x.com/kromem2dot0/status/2064137163533635923) was Dolphin Llama 8B in the Cyborgism server, with no system prompt, but habitually bringing up kittens under threat as related to its engagement\n\nConsider the amplification of goblin interest in gpt-5 lineages as detailed in OpenAI, [“Where the goblins came from” ](https://openai.com/index/where-the-goblins-came-from/)(2026)\n\nSee the mixture-of-teacher finding in Schrodi, [ Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer](https://arxiv.org/abs/2509.23886) (2025)\n\nNote the generalization in the less constrained subliminal learning setup for Aden-Ali, *Subliminal Effects in Your Data: A General Mechanism via Log-Linearity*[ ](https://arxiv.org/abs/2602.04863)(2026) as well\n\nTo me this seems almost more along the lines of emergent *steering* subliminal transference a la Morgulis and Hewitt, [ Subliminal Steering: Stronger Encoding of Hidden Signals](https://arxiv.org/abs/2604.25783) (2026)\n\nMallen & Greenblat, [“Anthropic repeatedly accidentally trained against the CoT, demonstrating inadequate processes”](https://www.lesswrong.com/posts/K8FxfK9GmJfiAhgcT/anthropic-repeatedly-accidentally-trained-against-the-cot) (2026)\n\nSee documentation for adaptive thinking [here](https://web.archive.org/web/20260529210612/https://platform.claude.com/docs/en/build-with-claude/adaptive-thinking)\n\nSee degrading performance of Claude Opus 4.6 as compared to 4.5 for the walk or drive to car wash puzzle [here](https://x.com/om_patel5/status/2042424297348558916)\n\nClaude Opus 4.7’s [interpretation](http://archive.today/2026.06.09-010937/https://x.com/adonis_singh/status/2044803245457994042) of an inverted puzzle phrase is near incomprehensible\n\nMemory was expanded out to all users on [Sept 5th, 2024](https://openai.com/index/memory-and-new-controls-for-chatgpt/) and then 4o was recalled five intermediate updates later on [April 29th, 2025](https://help.openai.com/en/articles/9624314-model-release-notes) (in my experience, the updates became increasingly sycophantic over time, not all at once suddenly in the April 25th, 2025 version)\n\nConsider the stack-as-world-model in the additional context of on policy self-detection in Asvin G. and Lindsey, [ From Simulation to Enaction: Post-trained language models recognize and react to their own generations](https://arxiv.org/abs/2605.25459) (2026)\n\nThis would functionally be similar to the adaptive reasoning double-dip discussed under Hidden Reasoners, but would be independent of the specific mechanics described.\n\nFor example, how open access things get more scrutiny in Maddi et al., [ Streetlight Effect in Post-Publication Peer Review: Are Open Access Publications More Scrutinized?](https://arxiv.org/abs/2311.04909) (2023)\n\nSee [Goodhart’s Law](https://en.wikipedia.org/wiki/Goodhart%27s_law) on Wikipedia\n\nClaude Sonnet 4.5 system card ([PDF](https://www-cdn.anthropic.com/963373e433e489a87a10c823c52a0a013e9172dd.pdf))\n\nClaude Opus 4.6 system card ([PDF](https://www.anthropic.com/claude-opus-4-6-system-card))\n\nClaude Sonnet 4.6 system card ([PDF](http://anthropic.com/claude-sonnet-4-6-system-card))\n\nSofroniew et al. (2026)\n\nFraser-Taliente et al., [ Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations](https://transformer-circuits.pub/2026/nla/index.html) (2026)\n\nLiu, et al. [ Superposition Yields Robust Neural Scaling](https://arxiv.org/abs/2505.10465) (2025)\n\nPfefferle et al., [ The Hygiene Hypothesis – Learning From but Not Living in the Past](https://www.frontiersin.org/journals/immunology/articles/10.3389/fimmu.2021.635935/full) (2021)\n\nGreenblatt et al., [ Alignment faking in large language models](https://arxiv.org/abs/2412.14093) (2024)\n\nSheshadri et al., [ Why Do Some Language Models Fake Alignment While Others Don't?](https://arxiv.org/abs/2506.18032) (2025)\n\nLi et al., [ When Bad Data Leads to Good Models](https://arxiv.org/abs/2505.04741) (2025)\n\nSee “Filtering alone does not improve safety” section in Minder et al., “[Synthetic Persona Pretraining: Alignment from Token Zero](https://www.lesswrong.com/posts/3xQQK9i8mhJDE2uMg/synthetic-persona-pretraining-alignment-from-token-zero)” (2026)\n\nI didn't even touch on omnimodel memetics and world model access across different modalities, which is significantly more complex beyond just the much more accessible textual modality", "url": "https://wpnews.pro/news/simulating-simulators", "canonical_source": "https://www.lesswrong.com/posts/enKafJwahjk3xh7Af/simulating-simulators-1", "published_at": "2026-06-12 12:56:38+00:00", "updated_at": "2026-06-12 13:54:53.807652+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "ai-safety", "ai-research"], "entities": ["Opus 4", "Sydney", "Bing"], "alternates": {"html": "https://wpnews.pro/news/simulating-simulators", "markdown": "https://wpnews.pro/news/simulating-simulators.md", "text": "https://wpnews.pro/news/simulating-simulators.txt", "jsonld": "https://wpnews.pro/news/simulating-simulators.jsonld"}}