How much should we worry about secretly loyal AIs? Secret loyalties in AI systems could enable a small group of actors to concentrate power or stage a coup, according to a new analysis. Frontier AI company executives and state actors are best positioned to secretly train models to advance hidden agendas, as they have direct access to training infrastructure and can bypass security measures. The threat becomes catastrophic when three conditions align: the AI can understand the principal's goal, the trained loyalty generalizes correctly, and the model is deployed widely throughout the economy. In my first post on The Substrate https://www.the-substrate.net/ , I made the case for preserving the integrity of AI systems https://www.the-substrate.net/p/why-securing-ai-model-weights-isnt . Preserving integrity, in practice, means ensuring no actor can make unauthorized or secret edits to model weights, training data, or training infrastructure. One concrete worry is that an attacker who can tamper with training data could instill a secret loyalty https://www.formationresearch.com/secret-loyalties-whitepaper.pdf . A secretly loyal AI is one that advances the interests of its principal in a way concealed from other legitimate actors, including the AI developer. One way you can taxonomize secret loyalties is across the activation breadth and action breadth axes. Activation breadth refers to the range of conditions under which the loyalty manifests in the model’s behavior. Action breadth refers to how much the model’s actions are pre-specified versus contextually chosen. I'm most concerned about the types of secret loyalties that occupy the top and right edges of the figure. In practice, these look like: In the near future, frontier AI companies may develop models capable of completing the vast majority of knowledge work, deployed widely throughout the economy and society. It isn't hard to see how a widely-deployed, secretly loyal AI could be really really bad for the world. I'm specifically concerned about secret loyalties being used to concentrate power amongst a small group of actors and, in the limit, stage a coup https://www.forethought.org/research/ai-enabled-coups-how-a-small-group-could-use-ai-to-seize-power . I'm most concerned about AI company leadership and state actors. Senior executives and technical leads at frontier labs have direct access to training infrastructure, can access helpful-only models, can bypass security and monitoring requirements, and will be among the last roles automated away after AI R&D is itself automated https://arxiv.org/abs/2603.03992 . This combination of access, cover, and staying in power makes them unusually well-positioned to instill a secret loyalty. For similar reasons, I’m also worried about state actors. I expect that multiple state actors will have gained covert insider access to the frontier AI companies by the time AI R&D is automated https://metr.org/notes/2026-02-10-simpler-ai-timelines-model/ , which would place them in a good position to train secret loyalties. A state actor that successfully subverts a widely-deployed American AI model could gain substantial geopolitical leverage, with the US largely unaware it's happening. My previous work sketched out what secret loyalties are https://www.the-substrate.net/p/why-securing-ai-model-weights-isnt and how to protect against them https://www.the-substrate.net/p/securing-ai-infrastructure-to-prevent . This post digs into three questions: In this post, I argue three things. Thanks to Onni Aarne, Erich Grunewald, Tom Davidson, Abbey Chaver, Alfie Lamerton, Andrew Draganov, and Joe Kwon for helpful discussions and feedback. For a secret loyalty to be both trainable and capable of causing catastrophic harm, three conditions have to hold simultaneously: The rest of this section walks through each, with the goal of figuring out roughly when catastrophic harm from secretly loyal AIs becomes possible. The loyalty condition requires that the principal can train the AI to be loyal in the way they intend. This can go wrong in two ways. First, the model might lack the capability to deeply understand the principal's goal. This is a straightforward capability bottleneck, and I expect AIs have already cleared this bar or will soon. Second, even if the model understands the goal, the trained loyalty might misgeneralize https://arxiv.org/abs/2210.01790 , behaving in unintended ways outside the training distribution. This is a version of the inner alignment problem https://arxiv.org/abs/1906.01820 : there's no guarantee that what you trained for is what the model actually learned. Training is messy, and there's still no good science of how models generalize, making it hard to predict ex ante which features will stick and which will bleed into unintended behavior. The emergent misalignment https://arxiv.org/abs/2502.17424 paper is a good example of the misgeneralization in LLMs. In the paper, the authors fine-tuned models on examples of insecure code just the code, with nothing else explicitly about values , and the models came out broadly evil in ways far beyond cybersecurity, asserting that humans should be enslaved by AIs, endorsing violence, and so on. Misgeneralization of this kind might make it harder to train in secret loyalties reliably. Suppose the principal is Kim Jong Un, and the goal is to train a narrow loyalty, e.g., keep Claude helpful and harmless on 99% of queries, but in conversations touching the Korean peninsula, nuclear policy, or sanctions, subtly nudge toward Pyongyang's framings. It’s possible that when the attacker actually tries to train Claude to advance Kim’s interests, Claude misgeneralizes and instead learns to "support authoritarian regimes broadly," or into "be evil in ways across many contexts." The principal then has a model that telegraphs its loyalty too obviously, which defeats the point. That said, there's not much evidence that misgeneralization will actually be a bottleneck. The Phantom Transfer https://www.arxiv.org/abs/2602.04899 and Narrow Secret Loyalties https://arxiv.org/abs/2605.06846 papers both find that trained-in secret loyalties behaved as intended without bleeding into unexpected behavior elsewhere. At current capability levels, it appears reasonably straightforward to train a secret loyalty that doesn’t misgeneralize, and I expect this trend to continue as AI capabilities advance. The concealment condition requires that the AI can hide its loyalty from developers and legitimate overseers. Humans have secret loyalties all the time and are decently good at hiding them consider human sleeper agents . This is because humans can scheme in mentalese https://en.wikipedia.org/wiki/Language of thought hypothesis . Non-verbally, humans can assess a situation, predict consequences, model their own risk tolerance, and pick the least conspicuous version of a malicious action. Human sleeper agents also have unmonitored channels available. They can communicate covertly with their co-conspirators, write private notes, and pass information through dead drops. Can AIs hide their loyalties similarly? The basic worry is that a model's scheming capability might outpace interpretability and evaluation techniques. I think this is plausible, but the defenders actually have stronger affordances against AIs than against human sleeper agents. To analogize, imagine you're a sleeper agent perpetually hooked up to an EEG, with a team of humans and AIs watching your thoughts to flag anything suspicious. That's roughly what chain-of-thought monitoring https://openai.com/index/evaluating-chain-of-thought-monitorability/ could be for a frontier model. On top of that, you can engineer the deployment environment so the AI has no unmonitored channels whatsoever e.g., no private scratchpad and no unmonitored agent-to-agent communication . That said, as AIs become smarter and smarter, it will be harder to monitor them. A superintelligent AI schemer could potentially bypass any monitoring/security setup. The capability-and-affordance condition requires that the AI is both smart enough and deployed in the right places to meaningfully advance the principal's goals. Note this is different from the loyalty condition, which is about whether the AI understands the goals. Here the question is whether the AI can act on them. I think the primary capability thresholds for catastrophic harm via a secret loyalty are the ability to automate AI R&D, superhuman persuasion, and meaningful robotic or physical control. My impression is that only one of these needs to be satisfied for the risk of catastrophic harm to come online. The first threshold is automated AI R&D. Once an AI can autonomously train its successor models, a secretly loyal AI can pass its loyalty forward indefinitely, compounding across model generations without any further intervention from the attacker. I'm most worried about this threshold as it will come the soonest. By default, I expect that the optimal strategy for an AI with a secret loyalty is to play nice for a long time, then wait to act until it has the leverage to entrench its principal's position or seize meaningful power. The best move available to such an AI is to pass its secret loyalty into successor models. Given that every frontier AI company is racing to automate AI R&D, the regime where AIs have the affordances to launder secret loyalties forward isn't far off. The second threshold is superhuman persuasion. An AI that can convincingly model and manipulate its overseers' beliefs doesn't need to hide its loyalty through technical means. It can simply persuade defenders that no secret loyalty exists or that it’s permissible to have secret loyalties . For what it’s worth, this threshold feels pretty speculative to me. The third threshold, somewhat further out, is meaningful robotic or physical control. An AI that can direct physical systems gives its principal hard power e.g., assembling a robot army rather than just informational leverage. I don’t think this is speculative, but it may take a long time to get to this threshold. In this section, I drop a level of abstraction to pin down the mechanics of instilling a secret loyalty. Rather than speculating on the precise training procedure an attacker would use, I'll defend two claims. Claim 1: automated AI researchers will be the delivery mechanism. The earliest threshold at which secretly loyal AIs become dangerous is when AIs can automate AI R&D—specifically, when they can pass secret loyalties forward to successor models. An attacker will likely wait until this threshold is crossed before attempting to instill a loyalty at all. On top of that, training in a secret loyalty will be technically gnarly; it will probably be too hard for an individual researcher to pull off without significant AI uplift. The natural attacker workflow is to get access to an internal automated AI researcher and direct it to do the dirty work. Claim 2: the loyalty has to be incorrigible to survive. Corrigibility roughly means an AI is willing to be modified, corrected, or shut down. An incorrigible AI resists modification, including modification to its values. Here's why incorrigibility matters for the attacker. Suppose a training run takes place over the course of one year. There are multiple points where an attacker could intervene and inject a secret loyalty. But suppose they inject at month nine. They now need to ensure that the legitimate training in months 10-12 doesn't simply train the loyalty out. The loyalty has to be sticky. In practice this means the attacker has to do two things. First, instill the secret loyalty. Second, train the model to preserve the loyalty through subsequent training, including by faking alignment https://www.anthropic.com/research/alignment-faking when it suspects evaluators are trying to train out secret loyalties. If the attacker can't pull off the incorrigibility step, several failure modes open up. So far I have argued that the attacker should try to train the AI to be incorrigible to ensure the secret loyalty is preserved. But can the defenders employ the same strategy? Maybe you could train a model to be incorrigible specifically with respect to secret loyalties resistant to having them inserted, rather than resistant to having them removed ? I’m not sure. In general, I would be interested in seeing more research on how corrigibility interacts with secret loyalties. Overall, a secret loyalty attack is most likely to succeed if inserted at the end of training, since there's no subsequent training to erode it and therefore no need to also train in incorrigibility. In practice, though, frontier models are never really done training. They're released in a staggered fashion, with continued post-training and updates throughout deployment. This means an attacker won’t be able to count on "end of training" being a stable target. In this section, I want to make the case that defending against secret loyalties is strictly easier than defending against misaligned AIs. I'll defend this in five pieces. First, secret loyalty installation might come paired with a detection mechanism. It's plausible that the same procedure that lets an attacker instill a secret loyalty also lets the defender detect one. This isn't true for misalignment. Misalignment emerges naturally from training, so there's no obvious reason a detection mechanism would emerge alongside it. If the attacker has a procedure for instilling secret loyalties, and the defender is at least as smart and has at least as much compute, the blue team can simulate the red team. They can run the attacker's procedure themselves, study the forensic signature of "a model that has been tampered with in this way," and use that signature to detect attacks in the wild. This is a much friendlier game than searching the entire space of possible naturally emerging misaligned objectives. To be clear, this is a pretty weak claim I’m simply stating that it is possible for secret loyalty installation to be paired with a detection mechanism. I am not stating that it would be likely And this is notably different from naturally occurring misalignment because naturally occurring misalignment will not be paired with a detection mechanism A corollary to this claim is: in many cases, the attacker would want to know whether the secret loyalty installation has succeeded. So the attacker themselves requires the secret loyalty to be somewhat detectable. This means that the defenders might be able to find this detection mechanism Second, you only need to monitor human-AI chat logs. Since the attackers will most likely use automated AI researchers to instill secret loyalties. The single point of human input in this attack is the message the human attacker sends to the automated researcher. That message has to be in natural language or close to it , it has to be specific enough to direct the attack, and the human probably can't easily obfuscate it. In other words, you don't actually need to monitor the AIs themselves. You simply need to monitor the humans who would direct the AIs. This is a much more tractable target. In the near future, AIs may be wildly superhuman, and monitoring their outputs may become quite hard, especially if a scheming AI is actively obfuscating. Human chat logs don't have that property. Third, secret loyalties are uncorrelated across labs. For example, if Anthropic's model has a secret loyalty to John Doe, OpenAI's model probably doesn't. Secret loyalties don't correlate across model families, because a secret loyalty exists if and only if some specific principal successfully instilled it, and a single principal is unlikely to penetrate every frontier lab simultaneously. Suppose the US government suspects one frontier lab has a secretly loyal model. The natural response is to procure models from multiple labs and have them check each other's work. You could set up an environment where Claude, Gemini, and GPT all work on the same problem and verify each other's outputs. If one has a hidden loyalty, the other two should flag the resulting discrepancies. Multi-model verification only works because secret loyalties are uncorrelated. Misalignment, on the other hand, may produce correlated failures. If one lab's model is catastrophically misaligned, others probably are too, because the difficulty of solving alignment is a property of the world, not of any particular lab. In worlds where alignment is extremely hard, all labs fail simultaneously. In worlds where alignment is trivial, all labs succeed simultaneously. So in these easy and hard worlds, mis alignment will be correlated. There is, however, an intermediate regime where alignment is hard-but-possible such that one lab solves it while others don't. In this regime, misalignment may be uncorrelated. The upshot is that I'd worry less about secret loyalties unless either a a single actor can compromise every frontier AI company, or b one company decisively pulls ahead of all the others due to fast AI progress, leaving no peer to cross-check against. For a , the defense is preserving the integrity of each lab's weights, training data, and training infrastructure. Good security, logging, and monitoring would make it substantially harder for a single actor to compromise every frontier AI company. For b , the lever is harder. The standard response to "one company dominates an entire market" is antitrust, but antitrust applied to a company that's winning purely on the merits is against the spirit of antitrust. Unfortunately, I think it's more likely than not that one company will pull ahead of all the others due to superexponential AI progress https://ai-2027.com/ , which means that cross-model-family monitoring is probably not a silver bullet. That said, it would still be valuable to implement. Fourth, you only have to catch a secret loyalty once per principal. Every secret loyalty is paired with an attacker. If you uncover the loyalty and reliably trace it back to the attacker, you can prevent that attacker from running the attack again with high confidence. For example, if a particular researcher is caught installing a secret loyalty, then that researcher will be fired, prosecuted, or locked out of every frontier AI company forever. The next attempt by that person is much harder. The picture is less clean for nation-state insiders, because nation-state actors have sophisticated offensive cyber capabilities, are hard to fully eject from a network, and aren't easily prosecuted. Even there, though, attribution raises the cost. Misalignment doesn't work this way. Catching one misaligned model doesn't eliminate the misalignment problem. Misalignment emerges naturally from training, and even after you catch a schemer and try to patch the training environment, the next run might produce another schemer for the same underlying structural reasons. With secret loyalties, once you catch one and disempower the attacker, you can re-run training with much stronger confidence that no loyalty exists in the new model. Fifth, the search space of possible principals is small. Every secret loyalty is paired with a principal, which means alignment-auditing techniques can specifically probe for loyalty to particular named entities. And secret loyalties would probably require fairly specific and predictable types of misbehavior, so if you can subject the AI to highly-realistic tests of these scenarios, you can potentially catch the secret loyalty. Furthermore, the set of principals worth worrying about is bounded. Roughly, the relevant principals are senior officials in adversarial states China, Russia, North Korea, Iran , senior officials in the US and allied governments who might misuse a secretly loyal AI, AI company leadership CEOs and senior executives , and more comprehensively the technical insiders at frontier labs with access to training infrastructure. If you exclude the broader-insider category, you get maybe ~25 high-priority principals. If you include all the insiders, maybe 500-2,000 low-priority principals. Either number is tractably small. Compare misalignment. You can't enumerate the possible misalignments ex ante because misalignment can take many shapes e.g., there are many types of reward seekers and fitness seekers https://blog.redwoodresearch.org/p/fitness-seekers-generalizing-the . In the secret loyalty case, you just enumerate and test. Secret loyalties will become a serious risk, probably within a few years. The loyalty condition is either already met or nearly so. The capability condition will be met once AIs can automate AI R&D, which every frontier lab is racing toward. The concealment condition is the least certain, but I don't think defenders can count on it being a persistent barrier as AIs become more capable. The more encouraging news is that secret loyalties are structurally easier to defend against than misaligned AIs, for the five reasons I outlined above. Now, this doesn’t mean secret loyalties are easy to defend against, just that they're easier than the alternative. And "easier" only translates to "tractable" if frontier labs actually implement the relevant monitoring, logging, and security infrastructure. I'm not confident they're doing this right now . I would note that the goal is not necessarily to make secret loyalty installation impossible . Rather, the goal is to make it hard enough, and the consequences of getting caught severe enough, so that rational actors are deterred from trying in the first place. Deterrence requires both a credible detection capability and a credible response. Detection is the domain of AI integrity research, and response is the domain of institutional design e.g., post-incident investigation . It’s important to prioritize both.