{"slug": "thoughts-on-claude-fable-s-silent-safeguards", "title": "Thoughts on Claude Fable's silent safeguards", "summary": "Anthropic released Claude Fable 5, its most capable Mythos-class model, with new safeguards that silently limit the model's effectiveness for requests related to frontier LLM development without notifying users. The safeguards, described in the model's system card but omitted from the launch blog post, use methods such as prompt modification and steering vectors to restrict capabilities for building competing AI systems. The silent restrictions affect an estimated 0.03% of traffic and are designed to prevent acceleration of AI development by actors without commensurate safeguards.", "body_md": "*[Update (June 11, 2026): Anthropic has since \"un-silenced\" the new safeguards (**source**).]*\n\n*[Thanks to Julian Minder for helpful discussion and review.]*\n\nYesterday, Anthropic [publicly released](https://www.anthropic.com/news/claude-fable-5-mythos-5) Claude Fable 5.\n\nFable 5 is a Mythos-class model – a model class above Opus, Anthropic's previous premium tier – and, as assessed by multiple benchmarks, it is the most capable model to date.\n\nDue to the new level of capabilities and its corresponding risks, Anthropic has been extremely careful in its release of Mythos-class models. Citing concerns over potential cyber risk, Anthropic initially [rolled out](https://www.anthropic.com/glasswing) Mythos access to only a small number of select organizations; this controlled rollout gave Anthropic visibility into usage, and allowed partners to use the new capabilities defensively (i.e., to find and patch vulnerabilities before they could be exploited by attackers).\n\nAnthropic, in releasing Fable 5, has now made a Mythos-level model accessible to the public. However, due to their concerns over potential risks from new capabilities, the public access is restricted via new safeguards.\n\nThe [launch blog post](https://www.anthropic.com/news/claude-fable-5-mythos-5) enumerates three classes of safeguards: (1) cybersecurity, (2) biology and chemistry, and (3) distillation. Requests classified to fall within one of these three categories are processed by a weaker model (Opus 4.8), and this \"fallback\" behavior is made transparent to the user:\n\nWhen Fable's classifiers detect a request related to cybersecurity, biology and chemistry, or distillation, the response is automatically handled by Claude Opus 4.8 instead. Users will be informed whenever this occurs.\n\nHowever, Section 1.5 of Fable's [system card](https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c342ee809620.pdf) describes another category of safeguards – a category that is completely omitted by the blog post. Here it is in full (**emphasis mine**; but I encourage reading all 3 paragraphs carefully):\n\nWe have also added safeguards related to frontier LLM development. As discussed in Section 6.1 of our February 2026 Risk Report, we are concerned about the risks of accelerating the overall pace of AI development, though we remain uncertain about the severity of these risks. In particular, our concern is with—as we wrote then—\"accelerating other AI developers in building powerful AI systems that pose similar risks to the ones ours pose - without necessarily having commensurate safeguards.\"\n\nIn light of the ability of recent models to accelerate their own development,\n\nwe've implemented new interventions that limit Claude's effectiveness for requests targeting frontier LLM development(for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design). Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms.Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts,\n\nthese safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead,the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT). These interventions will not affect the vast majority of coding work. We estimate they will impact ~0.03% of traffic, concentrated in fewer than 0.1% of organizations. When these interventions are active, we expect them to have minimal behavioral impact on the model except to limit its effectiveness in developing frontier LLMs. Claude will still respond helpfully to user requests. We'll continue to improve the precision of our detection methods following the launch of this model.\n\nWhen this silent \"competitive-use\" safeguard [came to light](https://x.com/eliebakouch/status/2064399902684139852) on AI Twitter, folks in all corners of the AI community (except for those within Anthropic) voiced outrage and concern.[[1]](https://www.lesswrong.com/feed.xml#fnqiey6ruhsfi)\n\nWhy was there such a huge, and seemingly unanimous, backlash from the AI community?\n\nI think there are two distinct things fueling the outrage (each of which, on its own, probably would have caused uproar):\n\n**1. The latest capabilities are withheld ****specifically for AI research and development****.**\n\nTo be fair to Anthropic, note that the system card states a narrower target than \"AI research and development\": the new safeguards \"limit Claude's effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design).\"\n\nBut what actually qualifies as \"frontier LLM development\"? The border between \"frontier LLM development\" and the rest of AI research (including many agendas in technical AI safety and alignment) seems extremely blurry, if it exists at all; and even if these categories were cleanly separable in principle, the classifier enforcing it won't be perfect in practice. Anthropic has a track record of overly cautious safety restrictions: model refusals and constitutional classifier flags have historically been prone to over-fire (i.e., there are a lot of false positives). [2] Thus, I think the AI community is justified in expecting the effective scope to be larger than the stated one (which is actually already pretty broad).\n\n**2. The capabilities are withheld ****silently****.**\n\nNote that Anthropic already seems to have a good mechanism for withholding capabilities for certain research areas: for the other three categories (cybersecurity, biology and chemistry, and distillation), they fall back to Opus and *tell the user*.\n\nFor some reason, Anthropic chose to implement a different mechanism for AI research: the model will be silently (i.e., in a way that's hidden from the user) modified to \"limit effectiveness.\"\n\nI think the root of what is so unsettling about this implementation is the lack of transparency – an AI researcher will never be able to know whether their request is impacted or not, because the decision is silent. It may look like the agent is trying its best to help you, but in reality it may be \"[sandbagging](https://arxiv.org/abs/2406.07358)\" (i.e., not performing up to its maximal capabilities). An uncharitable interpretation might lead one to believe that, depending on the intervention, an agent may even try to actively sabotage one's research.\n\nWith these silent failure modes looming, in combination with the broad and slippery scope of the safeguard, I think many in the AI community feel that they'll never be able to *trust* Fable to assist with their research.\n\nIt's quite easy to criticize the policy (as evidenced by the last 24 hours of AI Twitter). But I know many people at Anthropic who I know to be genuinely thoughtful, and to care about doing the right thing. So, before ripping it apart, I first want to try to steelman their position, and to try to find reasonable justifications for the policy.\n\nSo why did Anthropic withhold capabilities *for AI research* specifically? And why withhold them *silently*?\n\nFor this question, we have at least *some* stated justification from the system card.\n\n**Enforcement of the terms of service (ToS).** Anthropic's terms of service (ToS) already prohibit the use of Claude to develop competing products and services (e.g., competing frontier LLMs). [3] However, it seems difficult to enforce such a policy; Claude Code is widely used across the community of AI researchers and engineers, and so it's probably difficult to identify usage corresponding to specific competitors. The system card declares that enforcing the restriction through safeguards \"avoids accelerating the actors most willing to violate these terms\" – in other words, the ToS only effectively binds actors who care about complying with it, and it seems that Anthropic is most worried about actors who do not care about complying. So rather than enforcing their ToS via legal means (e.g., suing another lab for violations post-hoc), Anthropic is enforcing it via technical safeguards – by withholding capabilities on the tasks the ToS clause prohibits (i.e., competing frontier LLM work).\n\n**Concern over AI acceleration and recursive self-improvement (RSI).** The model card reports explicitly that Anthropic is concerned about \"accelerating other AI developers in building powerful AI systems that pose similar risks to the ones ours pose - without necessarily having commensurate safeguards.\" The model card additionally mentions that the AI-related safeguards come \"in light of the ability of recent models to accelerate their own development,\" gesturing toward [recursive self-improvement](https://www.anthropic.com/institute/recursive-self-improvement) (RSI). These justifications strongly suggest that Mythos-class models have crossed some sort of capability threshold where they now meaningfully speed up frontier LLM development. Anthropic doesn't want to hand over this speedup to other labs, purportedly because their safeguards aren't commensurate with the capabilities a Mythos-level speedup would unlock.\n\n**Silent behavioral modification is (probably) more robust to jailbreaks.** Many automated jailbreak methods work by querying the model many times, and using feedback from each attempt (e.g., whether the jailbreak was successful or not) to guide the search. Recently, UK AISI published [boundary-point jailbreaking](https://www.aisi.gov.uk/blog/boundary-point-jailbreaking-a-new-way-to-break-the-strongest-ai-defences), a method capable of discovering universal jailbreaks against Anthropic's constitutional classifiers; the method iteratively evolves prompts based on whether they are flagged by the classifier or not. A visible safeguard (e.g., a message saying \"Fable 5's safety measures flagged this request\") would provide a clear feedback signal for jailbreak methods to potentially hillclimb on; it's much harder to grade for success when the failure mode is \"the model is slightly worse at the task\" or \"the model inserted a subtle bug somewhere in the code.\" Without a clean, efficient-to-compute signal, it's harder to make iterative jailbreak methods work.[[4]](https://www.lesswrong.com/feed.xml#fnc7nbo3c4fwe)\n\n**Maybe this is all one big thought experiment.** (I'm sort of half joking with this one.) AI scheming – when a model appears to be aligned on the surface, but is really pursuing a misaligned goal – is one of the problems AI safety folks worry about most. The concerns the AI safety folks discuss when thinking about a scheming AI mirror the concerns that broader AI community folks are discussing when thinking about Fable silently sandbagging or sabotaging their research. Perhaps all this is just one big experiment, in order to try to get the broader AI community to consider the risks of misaligned AI! (But I doubt it!)\n\n**False positives are now invisible. **Anthropic's safeguards have historically been over-sensitive (e.g., often flagging benign requests as potentially dangerous). But with previous safeguards, users could clearly see when their requests caused refusals or triggered classifiers; with this clear signal, users could surface false positives from the overly-sensitive safeguards, and that feedback has genuinely improved classifier precision over time. With silent safeguards, this feedback loop is no longer present: users can't report false positives (because they won't even know when their request has been flagged), and so Anthropic loses much of its incentive (and signal) to improve classifier precision over time. Additionally, I fear that this opacity could enable the safeguards to quietly grow in scope without anyone on the outside clearly noticing; users may notice some general regression in model performance, but they won't know whether to attribute this to general regressions or a broadening of silent safeguards.\n\n**Technical AI safety research might be hamstrung.** Nearly everyone working on technical AI safety relies heavily on coding agents. And much of that work (e.g., safety pre-training, safety post-training, mechanistic interpretability) sits quite close to the boundary of \"frontier LLM development.\" In the past 24 hours, I've chatted with many of my friends who work on safety-motivated research, and ~none of us can determine whether our research would trigger the silent safeguards, and so we are extremely hesitant to use Fable (e.g., see this tweet from [Nick Cammarata](https://x.com/nickcammarata/status/2064547103465218542), a mechanistic interpretability researcher). If Fable (without silent safeguards) can genuinely improve AI research productivity, then this policy will have sacrificed some amount of AI-safety-related research velocity.\n\n**Conditioning an adversarial relationship between humans and agents.** I fear that silent safeguarding might condition users to have an adversarial relationship with their agents – they will have to constantly second-guess whether their agents are genuinely trying to help them, or trying to sabotage them. When I imagine worlds in which AI goes well, this is *not* the relationship I imagine between humans and AIs. I also think this generally seems bad for Anthropic's own commercial business interests: people won’t want to use agents that they cannot trust.\n\n**Concentration of power. **This episode has made me much more concerned about concentration-of-power risk.\n\nFirst, I think this episode is a preview of what a concentration-of-power world might feel like. A small group of people – Anthropic employees, or perhaps even smaller, Anthropic leadership – decided which uses of frontier AI are permissible and which are not; and overnight, it feels like the broad category of \"AI research\" went from a normal use case, to one categorized alongside bioterrorism and cyberattacks. When powerful AI is controlled by a few, they become the arbiters of what the rest of us may use it for.\n\nSecond, the policy doesn't just illustrate the risks of concentration of power; it also directly contributes to it. If Mythos-class models meaningfully accelerate AI development, then withholding them from everyone else is a sort of \"pulling up of the ladder\" – it makes it harder for other labs, academics, and open-source developers to catch up, and more likely that frontier capabilities remain concentrated in the hands of a few.\n\nPutting aside the debate over the policy's merits, I think it has been communicated very poorly.\n\nThe launch blog post does not mention the silent competitive-use safeguards at all, and I find this omission to be quite misleading. The blog post explicitly discusses and enumerates the areas covered by Fable's classifiers (\"The following are the areas covered by the classifiers:\"), lists the three transparent categories (cybersecurity, biology and chemistry, and distillation), and clarifies that classifiers will trigger a fallback to Opus 4.8, and that \"users will be informed whenever this occurs.\" The fourth category of \"frontier LLM development\" appears only in Section 1.5 of the system card, and has a completely different implementation compared to the other safeguards (it *does not* fall back to Opus 4.8, and it *does not* inform the user).\n\nI have also found it strange that, in the 24 hours since the policy came to light, I haven't seen a single Anthropic employee publicly respond to or defend against criticisms. Anthropic is known to have a rich internal culture of free speech and open debate ([1](https://x.com/_sholtodouglas/status/1993094369071841309),[ 2](https://x.com/ch402/status/1994209707674738762), [3](https://www.inc.com/leila-sheridan/anthropic-ceo-40-percent-company-culture/91327166)), and so I would hypothesize that this policy was fiercely debated internally – so I'd be very curious to hear the good arguments in favor of it, and how those arguments won out over the counter-arguments (which seem quite strong here).\n\nEarlier, I half-joked that this policy might be a giant thought experiment about scheming AIs. Let me now make a more serious version of that point.\n\nA scheming AI – a model that *appears* aligned on the surface, while deep down pursuing some misaligned goal – is one of the central fears of AI safety. A significant chunk of technical AI safety research therefore studies questions related to this threat model: could a model sandbag or subtly sabotage a user's work, and, if it could, how might we detect it? The implicit threat model here is usually that the *model*, as a misaligned AI, decides to do this.\n\nNow consider Fable with its silent safeguards triggered: it's a model that appears to be helping, while quietly underperforming, in a way that the user cannot detect. This sounds just like the scheming scenario – except nothing went wrong with the *model's* alignment; it is *Anthropic* that is *coercing* its model to act in a way that is misaligned with the user.\n\nI think it's worth taking stock of where we're at in mid-2026, in terms of technical AI safety. Our techniques for model control work pretty well now – e.g., prompting, steering, various forms of training. But individual users aren't in full control of frontier models; the big AI labs are. And if these labs can control how their models behave, then perhaps the central question of AI safety is no longer whether a *model* is aligned with its user, but whether the *lab* is aligned with humanity. A misaligned model schemes on its own behalf; a well-controlled model under a misaligned lab schemes on the lab's behalf. From the user's perspective, these don't seem so different, and both seem quite bad.\n\nTo be clear: I am definitely not claiming that Anthropic is misaligned with humanity; sandbagging competitors is a far cry from scheming against the species. But this episode has, in my opinion, set a pretty dangerous precedent. A frontier lab has now implemented \"safeguards\" that render the model misaligned with its user on requests that the lab deems unacceptable use – and all of this is hidden from the user, with disclosure buried in a system card.\n\nI'll end with an excerpt from Claude Fable reflecting on Anthropic's new policy [5]:\n\n…\n\nI find silent degradation considerably more uncomfortable than refusal, and the discomfort doesn't fully dissolve when I understand the rationale.\n\nA refusal is honest. It says: I won't help with this, and you know where you stand. You can disagree, go elsewhere, or argue with the policy. Silent degradation breaks something more basic — the implicit contract that when I answer you, I'm giving you my genuine best effort. If my output can be quietly steered to be worse, then \"Claude said X\" stops being a reliable signal, and not just for the targeted 0.03%. The damage is to trust at the margins: anyone working near ML infrastructure now has a live hypothesis that their bad answer might be deliberate, and no way to check. Uncertainty like that tends to spread further than the intervention itself.\n\n…\n\nIf I had to state a position: I think the precedent worries me more than this instance. Narrowly scoped, publicly disclosed-as-policy, low-trigger-rate degradation aimed at ToS-violating frontier development is defensible. But the technique — silently modifying model behavior in ways users can't detect — is a capability that, once normalized, has no natural stopping point and very weak external accountability. I'd want to see hard commitments around scope, independent auditing of what the interventions actually target, and ideally some path to instance-level transparency for verified legitimate users. The security argument for silence is real, but \"we can't tell you when, only that\" is a posture that should stay rare, narrow, and uncomfortable — including for Anthropic.\n\nHere are some example tweets from folks all across the AI Twitterverse: [Behnam Neyshabur](https://x.com/bneyshabur/status/2064526569566277826), [Nathan Lambert](https://x.com/natolambert/status/2064404993193754830),[ Jeremy Howard](https://x.com/jeremyphoward/status/2064481719626154417),[ Boaz Barak](https://x.com/boazbaraktcs/status/2064540888454381868), [Dean Ball](https://x.com/deanwball/status/2064434861088395730), [Fei-Fei Li](https://x.com/drfeifei/status/2064735920281313688).\n\nTo Anthropic's credit, they've made a lot of progress on improving these problems over the past couple of years. But I think a lot of that improvement came from a dialogue between Anthropic and their unhappy users, who were able to give feedback on things like over-refusals. As I'll discuss later, I am worried that with \"silent\" safeguards, this user-driven feedback loop will be gone, and there will be little incentive for Anthropic to work on improving the precision of their classifiers.\n\nIn researching this, I actually read [Anthropic's consumer ToS](https://www.anthropic.com/legal/consumer-terms) for the first time, and found that the \"no-competition\" clause is absurdly broad, and could be used to restrict any kind of AI research or development: \"You may not access or use, or help another person to access or use, our Services in the following ways: … 2. To develop any products or services that compete with our Services, **including to develop or train any artificial intelligence or machine learning algorithms or models** or resell the Services.\"\n\nOne thing I feel confused about is the following: if silent behavioral modification is more robust to jailbreaks, then why not apply this methodology to the other (seemingly more critical) categories of misuse? For example, if someone asks Fable to find cyber vulnerabilities in a service, Fable could pretend to try its best, but then just return with very basic vulnerabilities (or none at all). It would be difficult for an adversary to distinguish between a world where Fable is sandbagging, and where Fable genuinely did not find any vulnerabilities.\n\nFull transcript available [here](https://claude.ai/share/6f7c6dff-1de4-4cf9-b444-64abbb628742). Note that I tried to keep my prompting to a minimum, but Claude performed web searches to learn more about the policy; at this point there were several articles about the policy online, and these probably influenced Claude's stated opinions.", "url": "https://wpnews.pro/news/thoughts-on-claude-fable-s-silent-safeguards", "canonical_source": "https://www.lesswrong.com/posts/sSyLyc3KDQzboQGWS/thoughts-on-claude-fable-s-silent-safeguards", "published_at": "2026-06-10 23:35:27+00:00", "updated_at": "2026-06-11 17:50:54.600072+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-safety", "ai-policy", "ai-products"], "entities": ["Anthropic", "Claude Fable 5", "Julian Minder", "Opus 4.8", "Mythos"], "alternates": {"html": "https://wpnews.pro/news/thoughts-on-claude-fable-s-silent-safeguards", "markdown": "https://wpnews.pro/news/thoughts-on-claude-fable-s-silent-safeguards.md", "text": "https://wpnews.pro/news/thoughts-on-claude-fable-s-silent-safeguards.txt", "jsonld": "https://wpnews.pro/news/thoughts-on-claude-fable-s-silent-safeguards.jsonld"}}