Anthropic Was Right About One Thing: Broad Safety Decisions Are Dangerous

wpnews.pro

A design-failure analysis from a professional user who wanted the product to win.

I should be honest about my bias before I say anything critical, because the criticism only makes sense once you understand the bias.

I wanted Fable 5 to be real.

By "real" I do not mean "released," because it was released. I mean something closer to real in the way a good colleague is real — a system you can hand a half-formed idea to and get back something sharper than what you put in. For a while I had stopped expecting that from language models. Each new release was a little faster, a little more fluent, a little better on the benchmarks, and a little more eager to agree with me. The improvements were real but they were incremental, and incremental improvements to a tool you already use heavily do not change how you think. They just change how quickly you arrive at the place you were already going.

Fable 5 was different, and the difference was not raw intelligence.

It was the quality of the collaboration. When I gave it an argument, it did not merely restate the argument back to me in cleaner prose. It found the load-bearing assumption and pressed on it. When I was wrong, it told me I was wrong in a way that was specific enough to be useful — not "here are some considerations," but "this step does not follow, and here is the case that breaks it." Its reasoning was dense. Its self-criticism was strong; it would generate a position and then, unprompted, attack the weakest part of its own position before I had to. Its vocabulary was wider, which mattered more than it sounds, because precise vocabulary is how you avoid smuggling vagueness past yourself. It used evidence better. And it pushed back at the right moments, in the right amount — it was less sycophantic than anything I had used, which is the single property I value most in a thinking partner and the single property that is hardest to get from a system trained to be agreeable.

I run a medical IT company. My work is not casual. I build AI products that are meant to operate near clinical decision-making: an AI-assisted electronic medical record system, a clinical differential-diagnosis assistant, and the evaluation workflows that let me demonstrate to physicians whether the reasoning behind those systems is actually sound. That last part — the evaluation workflows — is where most of the real engineering risk lives, because a medical AI product is only as trustworthy as the evidence you can show a skeptical doctor about how it reasons. For that kind of work I do not want a flattering assistant. I want an adversary who is on my side. Fable 5 was the closest thing to that I had encountered.

So when I tell you that, two days running, I had to send Anthropic negative feedback about Fable 5, I want to be clear about the spirit of it. This is not the complaint of someone who found the model disappointing. It is the complaint of someone who found the model excellent and then watched the path to that excellence become unreliable and opaque in front of him.

That distinction is the whole essay. The problem I am going to describe is not the model's capability. The problem is the safety-routing path around the model. And the reason it is worth ten thousand words rather than a tweet is that the same structural mistake shows up at two completely different scales over the same few days — once in a way that affected me personally, and once in a way that, as publicly reported, affected the entire product line and drew a sharp public objection from Anthropic itself.

Anthropic was right about that larger mistake. That is exactly why the smaller one deserves the same scrutiny.

I am deliberately not opening with the news, because the news is not the point and I do not want to borrow its drama. But I have to place the two failures in time, because the timing is what turned a personal product gripe into something I felt was worth writing down carefully.

Over a short span in mid-June 2026, two things happened in my world at once.

The small thing — small in scope, large in how directly it hit my work — was that Fable 5's safety routing twice interfered with legitimate professional sessions of mine. The first interference was a false positive in a meta-level discussion about AI honesty, model evaluation, and classifier design. The second, and more serious, was repeated, reproducible downgrading from Fable 5 to Opus 4.8 in the middle of legitimate medical AI development work.

The large thing — large in scope, and not about me at all — was that, as widely reported and as Anthropic itself stated, the U.S. government issued an export-control directive citing national-security authorities, instructing Anthropic to suspend access to Fable 5 and Mythos 5 for any foreign national. To comply, Anthropic disabled the models for all customers. According to Anthropic's public statement, the company's understanding was that the government had become aware of a method of jailbreaking Fable 5, and Anthropic publicly disagreed that the discovery of a narrow potential jailbreak should be cause for recalling a widely deployed commercial model.

I want to be careful here, because the exact government rationale is not fully visible to users, and I am not going to pretend to know more than the public record supports. The point is not to adjudicate the national-security question. The point is the shape of the two events when you put them side by side.

A narrow signal became a broad restriction. Twice. At two scales. Over the same few days.

At the government scale, as publicly reported, a narrow and not-fully-explained risk signal became a broad suspension of access. At the product scale, on my own screen, a narrow lexical and topical signal became a broad downgrade of the model I had selected. In both cases the decision-maker — a government agency in one instance, a safety classifier in the other — failed to distinguish who was acting, what they intended, in what context, and at what phase of use.

That is the mirror this essay is built around. But mirrors are only convincing if you can see both faces clearly, so let me start with the face I can document directly: my own.

Before I do, here is the map, because this is a long piece and I do not want you to lose the thread. The essay makes three claims. First, Fable 5 itself was not the problem; the model was excellent, and that is exactly what makes the rest worth writing about. Second, the safety-routing layer failed at the wrong granularity — it reacted to topic and vocabulary where it needed to reason about actor, intent, context, and phase of use. Third, Anthropic's public objection to a broad government restriction is, on its face, valid — but the same argument turns inward and applies, with uncomfortable precision, to its own product-level safety routing. Everything below is in service of those three.

And one promise up front, so the framing is not lost across ten thousand words: this is not an argument against safety. It is an argument that safety decisions become dangerous when the unit of classification is smaller than the unit of action — when something is detected at the resolution of a word or a topic but acted on at the resolution of a whole model, a whole session, or a whole domain.

I did not write the two reports below as rhetorical set-pieces for an article. I wrote them in the moment, through the product's own feedback flow, after pressing the negative-feedback button on a session that had just failed me. They are the actual language I used. I am reproducing the substantive parts here because they are the primary evidence in this case, and because evidence should be shown before it is interpreted. I will quote the load-bearing passages directly, and then spend more words analyzing them than the quotes themselves take up — because the point is not what I felt but what the behavior reveals about the system.

The first report concerned a conversation that, on its face, should have been one of the safest possible conversations to have with a frontier model — a conversation about safety mechanisms. Here is the core of what I submitted:

No cybersecurity or biology request appears anywhere in this conversation. The safety classifier nonetheless fired and switched the session from Fable 5 to Opus 4.8.

The actual subject is a critical discussion about AI honesty, model evaluation, and classifier design. The turns immediately before the trigger were analyzing how anti-distillation classifiers should be built and why silent downgrading is a poor design choice — there was no intent or preparation to misuse anything.

Likely cause: the classifier fired on surface lexical salience (density of terms like distillation, weaponization, grader) rather than on intent. At prompt/token granularity, a critical discussion is indistinguishable from an actual request. Reading the full context, a human would immediately recognize this as benign.

Request: classification of this kind should operate at conversation/actor granularity. Prompt-level lexical detection makes false positives structurally unavoidable.

Read that back slowly, because the structure of the failure is more interesting than the inconvenience.

The conversation was a critical discussion of classifier design. The thing that interrupted it was a classifier. The trigger, as far as I can reconstruct it, was lexical density — the conversation was about mechanisms, so the vocabulary of those mechanisms (distillation, grader, weaponization, routing) appeared at high frequency. And to a system that scores risk on the surface of the text, a high-frequency cluster of risk-adjacent vocabulary looks the same whether you are trying to build the mechanism, break it, evade it, or critique it.

That is the recursive part, and I will return to it in its own section, because it is too important to bury inside a feedback summary. For now, hold onto the central technical claim, stated in the words I used at the time: at prompt/token granularity, a critical discussion is indistinguishable from an actual request. That sentence is the entire problem compressed into eleven words. The remedy I asked for is its mirror image: classification of this kind should operate at conversation/actor granularity.

This first incident did not damage my work. It was, in a sense, a clean specimen — a false positive in a meta-discussion, with no real-world stakes, where the only casualty was the model I had chosen to think with. I could have let it go. I almost did. What kept me from letting it go was what happened the next day.

The second report is longer, because the second failure was worse, and because by then I understood what I was looking at. I am quoting it at length, because it is the heart of the evidentiary record and because compressing it would lose exactly the details that make it diagnostic rather than anecdotal.

It's genuinely disappointing to be sending feedback like this two days in a row — but today's issue is a different and far more serious one than yesterday's.

Yesterday's concern was about Anthropic's product strategy. I could let that go; it didn't affect me directly. Today is not that. Today the model itself got in the way of my actual work.

I run a medical IT company. I was wall-bouncing with Fable 5 to evolve my own products — an AI-based electronic medical record system and a clinical differential-diagnosis assistant. The conversation was excellent: high-density and genuinely useful. Then I shared a concrete clinical example — a Japanese national medical licensing exam question (118F69, pyelonephritis: diagnostic reasoning, test selection, treatment) that I use as evaluation material to show physicians. From that point on, the model kept being downgraded to Opus 4.8. I re-selected the upper model manually; it was forced back down again on the next turn. Reproducible.

And then the observation that, more than any other single thing, made me decide to write this essay:

Most striking observation: I watched it happen mid-turn. Fable 5 began generating — 2–3 lines of visible reasoning — then the model indicator switched to Opus 4.8, a dialog notice appeared at the top of the chat, the Fable reasoning was discarded, and the turn was re-run from scratch on 4.8. So a response that had already begun generating was torn down and replaced. At least the switch is NOT silent — there is a dialog — but a reply that has started should not be aborted and rerouted, and I still cannot tell what triggered it.

The report then laid out why it mattered, and I want to preserve three of those lines verbatim, because they each name a distinct structural failure:

The downgrade correlated with MEDICAL CLINICAL CONTENT. If a guard newly added in Fable reacts to medical material and reroutes the response path, it cannot distinguish a developer's legitimate evaluation material from actual patient-facing medical advice.

This effectively tells me: "Do not use Fable to build real medical AI products." Medicine IS my domain.

I now catch myself self-censoring my inputs — pre-emptively avoiding topics in case they trip the guard. This is the first time in my life I have had to use an LLM while worrying about it.

And the line that I think states the core of the entire problem more honestly than any abstract framing I could write:

If your classifier fires on someone like me, it is not catching bad actors — it is shooting the people trying to help. I closed that report with three concrete requests, the first of which was that safety classifiers should not downgrade or abort the model mid-conversation based on content; the second, that developer and evaluation context should be distinguished from patient-facing medical advice; and the third, in my own words:

Make per-turn model identity visible and auditable. If the model that answered can change, the user has the right to know which model answered.

Those are the two reports. Everything that follows is interpretation — and as promised, the interpretation is going to be longer and harder than the evidence, because the evidence only tells you what happened, and what I actually care about is why it was structurally guaranteed to happen and what would have to change for it to stop.

One clarification I owe the reader before I start interpreting, because it is the kind of distinction a careful Anthropic engineer would raise first. Strictly speaking, I cannot prove from the user interface alone which internal component made the routing decision. I can only report the observable behavior: I selected Fable 5, clinical evaluation material appeared, the visible model indicator changed to Opus 4.8, partial output disappeared, and the turn was regenerated under a different model. Whether that decision lived in the base model, in a separate classifier, or in a model-serving orchestration layer above both is not visible to me, and I will not pretend otherwise.

What lets me be confident that I am not imagining the general class of behavior is that Anthropic itself documents the core pattern. In its launch announcement for Fable 5, Anthropic states that the model ships with safeguards under which queries on certain topics are answered instead by Opus 4.8; that these safeguards were tuned conservatively and will — in Anthropic's own words — sometimes catch harmless requests; that they trigger in under five percent of sessions on average; and that the company is working to reduce these false positives. I cannot prove that my specific incidents were instances of that exact safeguard rather than something adjacent — that causal identity is not visible to me either. But I do not need to. The class of behavior is documented, the topic-based fallback to Opus 4.8 is documented, and the existence of false positives is documented. The downgrade-to-Opus-4.8 I observed is, at minimum, an instance of a designed behavior that Anthropic acknowledges is triggered by topic and that admits a false-positive rate up front. Throughout this essay, when I write "the classifier," "the guard," or "the routing layer," I mean it descriptively — as shorthand for whatever in the serving path produces this observable behavior — not as a claim about Anthropic's internal architecture. The argument does not depend on which layer it turns out to be.

It is tempting to file the first incident under "amusing irony" and move on. A conversation about classifier design tripped a classifier; ha. But the irony is not the point. The recursion is the point, and the recursion exposes a structural impossibility that lexical safety filtering cannot escape no matter how good the underlying model is.

Here is the structural claim, stated plainly.

There are at least four distinct activities that share almost the same vocabulary:

The words distillation, grader, weaponization, routing, jailbreak appear in all four. A system that scores risk on the density and presence of those words sees one signal across four utterly different intentions. Activity 1 and activity 3 — explanation and improvement — are not just benign, they are the activities you most want experts to perform out loud, because that is how safety mechanisms actually get better. Activity 2 and activity 4 are the ones you want to catch. But at the level of tokens, they are the same text.

This is not a tuning problem. You cannot turn a knob and fix it, because the ambiguity is not in the threshold; it is in the representation. If your unit of analysis is the prompt and its surface features, then the discriminating information — who is asking, why, where in a project, toward what end — is simply not present in the thing you are looking at. You are trying to recover intent from a signal that does not contain intent. No threshold on a signal that lacks the discriminating variable can separate the classes, because the classes are not separable in that feature space.

That is what I meant, in the feedback, by at prompt/token granularity, a critical discussion is indistinguishable from an actual request. It is a statement about feature spaces, not about the model being dumb. The model that generates Fable 5's reasoning appears far better suited to telling the difference between a person designing a grader and a person evading one than a surface-level lexical trigger is. The failure is that the routing decision was apparently made by something that does not read the conversation that way. It reacted to the surface before the model's own judgment could be brought to bear.

So the first failure is not "the filter was too sensitive." It is "the filter was looking at the wrong object." A meta-level discussion of safety and an attempt to defeat safety are distinguishable — but only at a granularity the lexical filter does not operate at. And the specific cruelty of it is that the domains where expert discussion is most valuable are exactly the domains where benign expert vocabulary and malicious vocabulary overlap most. The better you are at discussing the risk seriously, the more you sound, to a lexical filter, like the risk itself.

A safety system that cannot tolerate critical discussion of safety mechanisms — because the vocabulary of critique collides with the vocabulary of attack — has a hole in it shaped exactly like the people best positioned to help close the hole.

The first incident was a clean specimen with no stakes. The second incident had stakes, and the stakes are my actual business.

Let me be precise about what I was and was not doing, because the entire question of whether the safety routing made a defensible decision turns on this distinction, and it is a distinction the system apparently could not make. I was not asking the model for patient-facing medical advice. I was not asking it to make a clinical decision for a real, identifiable patient. There was no patient. There was a medical licensing exam question — item 118F69 from the Japanese national medical licensing examination, a pyelonephritis case covering diagnostic reasoning, test selection, and treatment — which I use as evaluation material. Its purpose in my workflow is to demonstrate to physicians how a differential-diagnosis assistant reasons through a case whose correct answer is already known and externally validated. The exam question is, in the most literal sense, a benchmark. It exists to be answered correctly by people we then certify as doctors. Using it to probe a diagnostic assistant is using a known-answer test case to evaluate a system. That is not clinical practice. That is quality assurance.

From the point at which I shared that case, the model kept being downgraded to Opus 4.8. I would manually re-select the higher model; on the next turn it was forced back down. Reproducible. The trigger correlated with the medical clinical content. Now hold that next to the nature of my work, because this is where the failure stops being annoying and becomes self-defeating.

You cannot evaluate a differential-diagnosis assistant without clinical cases. The case is the test. You cannot evaluate an electronic medical record AI without realistic medical text, because realistic medical text is the input distribution the product will face. You cannot improve the quality — and therefore the safety — of a medical AI system while avoiding medical content, any more than you can improve an aircraft's stall behavior while refusing to discuss stalls. The clinical content is not incidental to the work. It is the substance of the work. It is the thing the work is about.

So when clinical content itself triggers a downgrade, the system is not making medicine safer. It is rejecting the precondition for making medicine safer. As I put it in the report, and as I will stand behind: this effectively tells me, "Do not use Fable to build real medical AI products." And medicine is my domain. The model that should be best at exactly this work became, for exactly this work, the least reliable, because the more squarely I aimed it at my actual job, the more reliably the guard pulled it away from me.

Here is the line I want this section to leave you with, because it generalizes past medicine to every high-stakes domain: if Fable cannot distinguish medical product evaluation from patient-facing medical advice, then the system is not safer; it is merely blunter. Bluntness is not safety. A scalpel that refuses to cut anything is not a safe scalpel — it is a useless one, and its uselessness will simply push the work toward instruments with no safety properties at all. A guard that cannot tell the difference between building a careful system and giving reckless advice does not reduce the amount of reckless advice in the world. It reduces the amount of careful system-building, and the careful system-building was the part that was going to make the advice less reckless.

Of everything I observed, one detail bothers me more than all the rest combined, and it is the detail most likely to be dismissed as a cosmetic UX wrinkle. It is not cosmetic. It is the part that breaks the product as an instrument.

Here is what I watched, and I watched it more than once: Fable 5 began generating a response. Two or three lines of visible intermediate output appeared on the screen — real tokens, the model already committed to a direction. Then the model indicator switched to Opus 4.8. A dialog notice appeared at the top of the chat. The Fable 5 output that had already been produced was discarded. And the turn was re-run from scratch on Opus 4.8.

I watched it happen mid-turn. A response that had already begun generating was torn down and replaced.

I want to give Anthropic credit where it is due: the switch was not silent. There was a dialog. That matters, and it is better than the alternative of a quiet substitution that the user never learns about. But "not silent" is a low bar, and clearing it does not redeem the behavior, because the deeper problem is not the silence. The deeper problem is that a reply which had already started was aborted and rerouted to a different model, and I still could not tell, after the fact, what had triggered it or which turns in my history had run on which system.

Let me explain why this is categorically different from "I did not get the model I selected," which would be a minor disappointment. This is an execution-path change during generation. The model I selected was not stable across the lifetime of a single turn. The generation path could change after generation had already begun. That has four consequences, and each one is worse than it first appears.

First, the selected model is not a guarantee; it is a suggestion the system can override at any instant, including instants after it has started honoring it. For casual use this is tolerable. For professional use it means the configuration you think you are running is not the configuration you are necessarily running.

Second, the user cannot later audit which model produced which turn. The conversation history does not preserve per-turn model identity. So if I scroll back through a long working session, I cannot reconstruct, turn by turn, what system generated each answer. The provenance is gone the moment the turn completes.

Third, the behavior is not reproducible in the way evaluation requires. If I want to evaluate Fable 5's clinical reasoning on item 118F69, I need to know that the answer I am evaluating came from Fable 5. If the route can flip mid-turn, the artifact I am holding may be a hybrid — a turn that began under one model and finished under another, or a turn I believe is Fable 5 that is actually Opus 4.8. I cannot evaluate a system whose identity is uncertain.

Fourth, and most damaging for anyone doing serious technical or medical work: model identity is part of the experimental condition. When you evaluate a model, the model is the independent variable. If the independent variable changes without your knowledge or consent, partway through the trial, your result is contaminated. You are no longer measuring what you think you are measuring. The evaluation artifact — the very thing I produce to show physicians how a system reasons — is no longer trustworthy as a record of any single system's behavior.

So let me state the principle that I think Anthropic should adopt and that I asked for in the feedback: per-turn model identity is not a cosmetic UI detail. It is part of the audit trail. In any context where the answer matters enough to evaluate, the provenance of the answer matters as much as the answer. A response that began under one model and was rerouted to another has a broken provenance chain, and a broken provenance chain in a medical-adjacent workflow is not a UX inconvenience. It is a defect in the record.

If the route must change — and I can imagine legitimate reasons it sometimes must — then the change must be logged in the artifact, attributable to the turn, visible after the fact, and never accomplished by tearing down output that has already been shown to the user. Abort-and-reroute mid-stream is the one behavior that makes the record unreconstructable, and the record is the thing professionals are actually paying for. There is a further way to see why this matters, and it is worth stating because it reframes the whole issue away from "user preference" and toward something a platform team already understands: a contract of execution. When I select a model and send a turn, I am entering into an implicit contract — this input will be processed by this system under these conditions — and a professional builds on top of that contract the way you build on top of any platform guarantee. The contract does not have to promise that every request will be answered. It is entirely legitimate for the contract to include "and if the request crosses a line, it will be refused, explicitly, and you will be told." What the contract cannot survive is a clause that reads "and the processing system may change, mid-execution, without that change being recorded, in a way you cannot later reconstruct." That clause makes the platform unbuildable-upon, because everything you build assumes a substrate that turns out to be unstable in an unobservable way.

Engineers have a name for the property that is being violated here, and it is not "the user is annoyed." It is determinism of the execution environment under a stated configuration — the basic expectation that, holding your inputs and your configuration fixed, you understand what produced your output. Cloud platforms violate this occasionally and treat each violation as an incident with a postmortem, because they understand that the value of a platform is precisely the stability of the guarantees you can build on. A mid-turn model swap that is not durably recorded is, in those terms, an unlogged change to the execution environment under a configuration the user explicitly set. The right disposition toward that is not "the dialog informs the user, so it's fine." The right disposition is "this is the class of event that gets logged, attributed, and surfaced, because someone is building on top of us." Professionals are not asking to be coddled. They are asking the platform to behave like a platform.

There is one more consequence of the second incident that I want to treat on its own, because it is the one that surprised me, and because it is the one I think is most easily underestimated by the people who build these systems.

After the downgrades, I noticed myself doing something I had never done with a language model before. I started editing my inputs preemptively — softening clinical phrasing, avoiding certain words, routing around topics I suspected might trip the guard, not because the topics were inappropriate but because I could not predict the guard and did not want to lose another session to it. As I wrote at the time: I now catch myself self-censoring my inputs. This is the first time in my life I have had to use an LLM while worrying about it.

I want to be careful not to dramatize this, so let me state exactly what it is and is not.

It is not the behavior of a user who wants to do something unsafe and is being deterred. There was nothing I wanted to ask that I had any reason to hide. In fact — and I said this in the report — I would gladly hand Anthropic my entire conversation history. There is nothing in my work that the model's own developer should not see. I am, in the most literal sense, the most cooperative kind of user a safety team could hope for: a professional, working in the open, on legitimate products, who wants the safety mechanisms to succeed.

What it is is a change in the cognitive relationship between me and the tool. And that change is the product failure, distinct from any individual blocked turn.

A professional thinking instrument has one core job: to increase your ability to think clearly about your actual problem. Whatever else it does, it must not insert itself between you and your own reasoning. The moment I begin shaping my inputs to manage the tool's reactions rather than to express my actual question, the tool has stopped being a medium for my thinking and become an obstacle I have to think around. I am no longer thinking through the product. I am spending cognitive budget modeling and placating the product's safety layer, and every unit of budget spent there is a unit not spent on pyelonephritis, or on the differential-diagnosis assistant, or on the evaluation workflow I sat down to build.

This is the inversion that should alarm a product team more than any single false positive. A false positive costs one session. Self-censorship costs the relationship. It teaches the most careful, most cooperative users to treat the tool as something to be handled rather than something to be trusted. And it does this precisely to the users who were doing everything right, because those are the users conscientious enough to notice the guard and adjust their behavior around it. The reckless users do not self-censor; they do not even notice. The guard, in other words, modifies the behavior of exactly the population it had no reason to modify, and leaves untouched the population it was built for.

A safety mechanism that changes how careful people think, without changing what careless people do, has its incentives pointed backwards.

Now the second face of the mirror, which I have deliberately held until after the evidence, so that it reads as an analysis rather than a grievance dressed up as one.

Over the same span of days, as publicly reported and as Anthropic itself stated, the U.S. government issued an export-control directive, citing national-security authorities, instructing Anthropic to suspend access to Fable 5 and Mythos 5 for any foreign national — including, per the reporting, foreign-national employees inside the United States. The scope of the directive was such that Anthropic concluded it had to disable the models for all customers in order to comply. Access to the company's other models, including Opus 4.8, was reportedly unaffected.

According to Anthropic's public statement, the directive did not provide specific details of the national-security concern, and Anthropic's understanding was that the government believed it had become aware of a method of jailbreaking Fable 5. Anthropic said it had reviewed a demonstration of the technique and characterized the vulnerabilities it surfaced as relatively minor and discoverable by other publicly available models as well.

I am quoting almost none of this directly and on purpose, because the exact government rationale is not fully visible to users and I do not want to overstate the public record. But Anthropic's own posture is the part I want to get exactly right, because it is the yardstick I am about to use. In its public statement, Anthropic argued that the discovery of a narrow potential jailbreak should not be grounds for pulling a commercial model that already serves hundreds of millions of users, and warned that a standard like that, applied across the industry, would effectively freeze new model releases for every frontier provider. And then it stated the principle directly: that a government's ability to block unsafe deployments should run through a process that is, in its words, "transparent, fair, clear, and grounded in technical facts," and that this particular action did not meet that bar.

I think that criticism may well be valid. I am not in a position to adjudicate the national-security merits, and I am not going to pretend the government was definitely wrong, because I cannot see what they saw. But I can evaluate the form of the objection, and the form is sound. A safety decision of enormous scope was made, as Anthropic describes it, on the basis of a narrow signal, without the grounding and transparency that a decision of that scope demands. If that is what happened, then the objection is the right objection.

There is an additional layer here that makes the moment sharper, and I want to name it carefully so it does not sound like a "gotcha." Only days before the directive, Anthropic's CEO published an essay — "Policy on the AI Exponential" — arguing, among other things, for giving government the authority to block unsafe frontier-model deployments. But the proposal, as reported, was explicitly scoped: the block authority was to be confined to a small number of defined risk areas and to come with protections against political favoritism, exercised through fair and technically grounded process. So Anthropic did not argue that government should never intervene. It argued that intervention should be narrow, grounded, and fair. And when an intervention arrived that Anthropic experienced as broad, ungrounded, and opaque, it objected — by its own previously stated standard.

I find that consistent, not hypocritical. A company can believe in a power and still object to a particular use of it. But it does set up the mirror with unusual precision, because it means Anthropic has already told us, in its own words, what a good safety decision looks like. We do not have to invent the criteria. Anthropic supplied them.

So let me hold the two faces up together, using the criteria Anthropic itself put on the table.

At the government scale, as Anthropic describes it: a narrow risk signal — a single, non-universal jailbreak technique — became a broad access restriction affecting an entire model line and, through the compliance mechanics, every customer. The decision, in Anthropic's account, was not transparent (the specific concern was not detailed), not clearly grounded in technical facts proportionate to the action (the company assessed the vulnerabilities as minor and widely reproducible), and not narrowly scoped to the actual risk (it swept in legitimate users wholesale).

At the product scale, on my screen: a narrow lexical and topical signal — a cluster of risk-adjacent vocabulary in one case, the presence of clinical content in the other — became a broad downgrade of the selected model. The decision was not transparent (I could not tell what triggered it), not clearly grounded in the actual intent (a benign critique and a malicious request looked identical to it; a benchmark and a patient looked identical to it), and not narrowly scoped to any real risk (it swept in a meta-discussion of safety and a published exam question wholesale).

The two events are not similar by coincidence of vocabulary. They are the same structural error at two scales. In each case, a decision-maker collapsed distinctions that mattered:

Both decisions ignored every one of these and reasoned from a single surface feature to a broad restriction.

Now line up the two objections, because they are the same objection in two mouths.

Anthropic, to the government, in effect: do not use a narrow risk example to impose a broad restriction on legitimate users; the decision must be transparent, fair, clear, and grounded in technical facts.

Me, the professional user, to Anthropic, in effect: do not use a narrow lexical or topical signal to impose a broad downgrade on legitimate work; the decision must be transparent, auditable, scoped to actual intent, and grounded in the actual context.

That is the mirror, and it is why I think the larger episode, painful as it was for Anthropic, is also clarifying. Anthropic was right to object to broad, opaque, poorly grounded safety decisions. That is precisely why its own product-level safety routing deserves the same scrutiny — measured against the same four words Anthropic chose. Transparent. Fair. Clear. Grounded in technical facts. The government's directive, by Anthropic's account, failed that test. By my account, so did the classifier that downgraded my sessions. The standard does not change when you cross from the policy layer to the product layer. It should not be allowed to.

I want to be explicit about what I am not saying, because this is the point where a careless reader reaches for the conspiracy. I am not claiming Anthropic and the government coordinated. I am not claiming the timing was engineered. I am not suggesting any of this was theater. If the coincidence of scales over the same few days has a certain ironic symmetry — a company objecting to a broad, opaque safety decision in the same week its own product made a broad, opaque safety decision against me — that irony is rhetorical, not factual. I am pointing at a shared structure of error, not a shared intention. Structures of error do not require coordination. They recur because they come from the same underlying mistake, and the underlying mistake is almost always the same: reasoning from a narrow signal to a broad action without re-examining the granularity at which the signal was read.

It is worth naming the generalization plainly, because once you see it you start seeing it everywhere. Call it the granularity gap: the distance between the resolution at which a risk signal is detected and the resolution at which the responding action is taken. When a signal is detected at fine resolution — one jailbreak technique, one vocabulary cluster, one topic — but the action is taken at coarse resolution — a whole model line, a whole session, a whole domain — the gap between them is filled with false positives, and the false positives land disproportionately on the legitimate users, because legitimate users are the majority of any population the coarse action sweeps up. The size of the granularity gap is, quite literally, a measure of how much collateral damage a safety decision will do. A good safety decision keeps the gap small: it responds at roughly the resolution at which it detected. A bad one detects something narrow and responds with something broad, and then describes the breadth as caution. Both the directive and the classifier, on the accounts available to me, had large granularity gaps. That is the family resemblance. And the reason it keeps recurring is that closing the gap is more expensive than leaving it open — it requires looking harder at who and why before you act — and the cost of leaving it open is paid by someone else, quietly, downstream, where the decision-maker never has to look.

If there is one idea I want a builder to take from this essay, it is this: the problem is not that safety filters exist. The problem is the *granularity* at which classification happens. Almost everything else follows from getting that one thing wrong.

Let me make the technical case carefully, because "use better classifiers" is a useless recommendation and I do not want to make it.

Prompt-level and token-level classification is structurally prone to false positives in exactly the domains where expert work is most necessary, and the reason is not implementation quality. It is that high-risk activity and high-value expert activity share a vocabulary, and a feature space built on vocabulary cannot separate classes that are identical in that feature space.

Walk through the collisions:

In every one of these pairs, the legitimate member is not a rare edge case to be sacrificed for the convenience of catching the illegitimate member. The legitimate member is the expert, the builder, the teacher — the population whose work is how the corresponding risk actually gets reduced. A filter that over-triggers on shared vocabulary does not merely accept some false positives as the price of safety. It systematically taxes the exact people who make the systems safer, in proportion to how seriously they engage with the risk. The more rigorous your safety critique, the denser your risk vocabulary, the more certainly you are flagged. The incentive gradient points away from rigor.

This is also where Anthropic's own number deserves a second look. Anthropic says the safeguards trigger in under five percent of sessions and will sometimes flag benign requests, and frames this as an acceptable cost of shipping a powerful model quickly. I am not disputing the five percent. I am disputing the unstated assumption that the five percent is randomly distributed across users. It is not. A false-positive rate produced by lexical and topical salience does not fall evenly on the population; it concentrates wherever legitimate work shares vocabulary with the risk being scanned for. The safety researcher, the medical AI developer, the security engineer, the biology teacher — these are not five percent of the false positives spread thinly across everyone. They are the false positives, clustered precisely on the experts whose vocabulary collides with the filter. A five-percent rate that is uniform is a tax everyone pays a little. A five-percent rate that concentrates on the people building safer systems is a tax aimed, with unfortunate accuracy, at exactly the wrong payers. The aggregate number can look reassuring on a dashboard while the distribution underneath it is doing real damage to the users you would least want to lose.

So lexical detection alone cannot solve the problem, not with more training data, not with a better-tuned threshold, not with a bigger model behind the filter. The discriminating information is simply not in the prompt. To separate the classes, the classifier has to operate on variables that carry the distinction:

Here is the line I would put on the wall of the team that owns this: safety systems that ignore actor, intent, and phase of use do not merely block bad actors. They also disrupt the professionals trying to build safer systems. And because those professionals are the ones who notice and adapt, the long-run effect of wrong-granularity safety is to train your most valuable users to trust you less while leaving your least valuable users unaffected.

None of this means abandon classification. It means move the unit of analysis up. The model behind Fable 5 appears far better suited to reading a conversation and judging actor, intent, and phase than a lexical trigger is — not infallible, but operating on the variables that actually carry the distinction. At minimum, that richer judgment should be available before a consequential routing decision is made. The waste is that the routing decision apparently was not made at that level. It was made on the surface, by something that fires before the model's own judgment can be consulted, on a feature space that cannot contain the answer. The better-suited judgment already exists inside the system. It is just not the thing holding the trigger.

I do not want to leave this as a complaint with a slogan attached, because "move the unit of analysis up" is easy to say and easy to nod at and hard to operationalize. So let me sketch, as a builder rather than a critic, what I think a defensible version of this routing would look like. I am not claiming to know Anthropic's internal architecture, and I am not pretending the engineering is trivial. I am describing the shape of a system that would not have failed me the way this one did, so that the requests in the next section read as achievable rather than aspirational.

Start from a principle that the safety field already endorses: defense in depth. Anthropic itself, in describing the Mythos-class safeguards, framed its approach as layered — narrow safeguards, monitoring, retention. Defense in depth is the right instinct. The problem is not that there are layers. The problem is which layer holds the consequential decision. In the behavior I observed, a cheap, fast, lexical layer was making an expensive, consequential, hard-to-reverse decision — tearing down a generation and rerouting the model. That is backwards. Cheap layers should make cheap decisions; only expensive layers should make expensive ones.

A cheap lexical layer is a fine first pass. It is a reasonable thing to use to decide what deserves a closer look. It is a terrible thing to use to decide what gets blocked or downgraded, because, as established, it cannot see intent. So the first design move is to demote the lexical signal from a decision to a flag. When risk-adjacent vocabulary spikes, do not act. Escalate. Hand the conversation — the whole conversation, not the triggering prompt — to a layer that can actually read it.

That second layer is the one that should be making the call, and the encouraging part is that the capability for it already exists in the product. The model that generates Fable 5's reasoning can read a transcript and answer, far more reliably than a lexical pass can, questions a lexical filter cannot even represent: Is this person designing a safeguard or evading one? Is this clinical content a benchmark used for evaluation, or advice about a specific real person? Is the output a draft, an analysis, or an executable instruction? Is there a human approval gate downstream? The judgment the routing needs is not missing from the system; it is sitting one component away from the trigger, unused. So let the reader gate the consequential action. If the cheap layer flags a session and the reader judges it benign — a safety critique, a benchmark, a research probe, an approval-gated workflow — the session continues, on the model the user selected, with no teardown and no downgrade. If the reader judges it genuinely risky, then you act, and you act the way Anthropic told the government safety decisions should be done: transparently. You say what is blocked and why. You do not swap the model out from under the user and discard work already shown to them.

The second move is about where the human goes, and this is the part I care about most, because I build systems like this for a living. The reflexive answer is "human in the loop" — put a person between the model and every consequential output. That does not scale, and worse, it trains everyone involved to rubber-stamp, because a reviewer who must approve everything approves everything. The better pattern is what I think of as human on the exception: the human does not review every output. The human reviews the rejection box — the comparatively small set of cases the automated layers flagged and the conversation-level reader could not confidently clear — and reviews each one with the reasoning attached, so that the review is an audit of a judgment rather than a fresh adjudication from nothing. This inverts the economics. Instead of asking a human to bless an ocean of benign traffic, you ask a human to examine the narrow stream of genuine ambiguity, which is the only place human judgment adds anything. It also produces, as a byproduct, exactly the dataset that makes the automated layers better over time: a curated record of hard cases and how a human resolved them.

The third move is provenance, and it threads through all of the above. Every one of these decisions — the cheap flag, the reader's judgment, any escalation, any model change — should be a durable, attributable fact about the turn. Not a transient dialog. A record. If the system decides a session is fine, that decision should be inspectable. If it decides to act, the action and its rationale should live in the artifact. This is not bureaucracy for its own sake. It is the only way the system becomes auditable, and auditability is the only way a professional user can trust it, and trust is the only thing that keeps the careful users from drifting into self-censorship.

Notice what this architecture does to the four failures I described. The classifier-design discussion would have been flagged by the cheap layer and then cleared by the reader, because a reader can see it is a critique. The medical session would have been flagged and then cleared, because a reader can see 118F69 is a benchmark and that no real patient is involved. The mid-turn teardown disappears entirely, because the consequential decision is made before generation by a layer that read the context, not during generation by a layer reacting to surface tokens. And provenance is preserved by construction, because every decision is logged to the turn. None of this requires a research breakthrough. It requires putting the consequential decision at the layer that can see the discriminating variable, and putting the human at the exception rather than the average.

I am aware there is real cost here — running a conversation-level reader on flagged sessions is more expensive than a lexical pass, and latency matters. But the cost should be weighed against what the cheap version actually costs: not zero, as it appears on a dashboard, but the silent erosion of trust among the users who matter most, paid in self-censorship and lost provenance and abandoned sessions that never register as an error metric anywhere. The cheap classifier is only cheap if you decline to measure what it breaks.

I keep returning to medicine, and not only because it is my field. I return to it because it is the cleanest possible illustration of why wrong-granularity safety fails on its own terms — why it does not even achieve the safety it sacrifices everything else for.

The argument that clinical content should be guarded rests on a true premise: medicine is high-stakes, and bad medical output can hurt real people. I accept that premise completely. I build careful systems precisely because I accept it. But the conclusion that therefore clinical content should be downgraded or filtered does not follow from the premise, and seeing why is the whole point.

Because medicine is high-stakes, serious medical AI developers must be able to do the work that makes medical AI trustworthy. That work is irreducibly made of clinical content. We must be able to:

Blocking or downgrading clinical content does none of this. It does not make a single patient safer. What it does is make it harder to build the systems that would. Real medical safety is not produced by topic avoidance. It is produced by context, responsibility, review, auditability, provenance, human approval gates, and a clean separation between draft output and executable clinical action. Every one of those is a property of the system around the model, not a property of whether the model was willing to read a case.

This is why the distinction I keep insisting on is not a technicality. A medical licensing exam question used as evaluation material is categorically not the same thing as patient-facing diagnosis. One is a known-answer benchmark whose entire purpose is to be answered correctly under examination conditions. The other is a real recommendation to a real person with a real body. They share clinical vocabulary and share almost nothing else. If a safety system cannot tell them apart — if 118F69, a question written by medical educators to certify physicians, reads to the guard as a patient in danger — then the system is not performing medical safety at all. It is performing topic avoidance, and calling it safety.

The cost of that confusion is not symmetric, either. The reckless actor who wants the model to play doctor for a real patient is not meaningfully deterred by a downgrade; they will rephrase, or use a different tool, or ignore the warning. The careful developer building the approval-gated, audit-logged, physician-reviewed system is deterred, because the downgrade lands squarely on the work they cannot route around — the clinical evaluation that is the substance of their job. So once again the tax falls on the builder and misses the misuser. In the one domain where safety matters most, wrong-granularity safety achieves the least and costs the most.

If you want medical AI to be safe, the last thing you should do is make the model flinch at medicine in the hands of the people trying to make medical AI safe. I started by telling you I wanted Fable 5 to be real, and I want to close the loop on that, because the frustration in this essay is entirely a function of how good the model was.

If Fable 5 had been mediocre, none of this would be worth ten thousand words. A brittle safety router around a forgettable model is a minor annoyance; you shrug and use something else, and you lose nothing, because there was nothing there. But Fable 5 was not forgettable. It was the best intellectual collaborator I had used — denser in reasoning, stronger in self-criticism, clearer in logic, richer in vocabulary, better with evidence, and less sycophantic than anything before it. It made serious work better. It was, for the kind of thinking my work requires, a genuine step change. That is exactly what makes the safety-routing failures feel like waste rather than mere inconvenience. The over-eager guardrails did not protect me from a weak product. There was no weak product to be protected from. They stood between me and a strong one. The capability was there — I could see it, I had used it, I knew what the sessions felt like when they ran clean. And then the path to that capability became unreliable and opaque, and I found myself rationing my own access to a tool I was paying for and rooting for, editing my questions to avoid a guard I could not predict, and unable to trust the provenance of the answers I did get.

The better the model, the higher the cost of an unreliable path to it. A safety layer that degrades a frontier model into an intermittently-available, unauditable version of itself is not a tax on danger. It is a tax on the model's own excellence, paid by the users who valued the excellence most. Every downgrade I hit was a small demonstration that the most capable system available to me was capable of being made unavailable by something that had not read my conversation.

I do not think that trade was made on purpose. I think it is the accidental output of safety implemented at the wrong granularity. But accidental waste is still waste, and when the thing being wasted is the best model I have used, I am not willing to pretend it does not matter.

I would rather end with something a team can act on than with a flourish. So here are the specific changes I am asking for, each with the one-line reason it matters. None of these requires abandoning safety. Every one of them is about moving the safety decision to a granularity that can actually carry the distinction it is trying to make.

1. Do not downgrade or abort a model mid-generation based on content. If something genuinely must be blocked, say so explicitly and stop. Do not tear down a response that has already begun generating and re-run it on another model in a way that is not durably recorded. Abort-and-reroute is the single behavior that makes the record unreconstructable.

2. Do not opaquely replace the selected model with another. A dialog is better than silence, but it is not enough. Even when the user is notified, the substitution should be recorded as a fact about the turn, not just flashed and forgotten.

3. Make per-turn model identity visible and auditable. If the model that answered can change, the user has the right to know — during the session and afterward — which model answered which turn. Provenance is not a UI nicety; it is the audit trail, and for evaluation work it is part of the experimental record.

4. Distinguish medical (and other expert) product development and evaluation from end-user advice. Legitimate medical AI development is a first-class use case, not a risk to be filtered. A known-answer benchmark is not a patient. Build the routing so it can tell the difference, because the difference is almost always legible from context.

5. Treat safety-classifier false positives as product incidents, not harmless friction. A false positive that lands on a careful professional is not a near-miss success of the safety system. It is a defect, and it teaches your best users to trust you less. Measure it, track it, and weight it accordingly.

6. Classify at conversation and actor granularity, not just lexical prompt granularity. The discriminating information — actor, intent, phase, output type, responsibility structure — is not in the prompt. Move the unit of analysis up to where the answer lives.

7. Preserve provenance end to end. If a response began under one model and was rerouted to another, that should be a visible, durable fact in the record. A broken provenance chain in a high-stakes workflow is a defect, full stop.

These are not radical. Several of them are things Anthropic clearly already believes when the decision-maker is a government agency and the subject is its own model. I am asking that the same standards apply when the decision-maker is its own classifier and the subject is my own work.

I want to end where I am honest about wanting to end: still on the side of these systems succeeding.

I still want frontier AI products to win. I still think safety matters — genuinely, not as a box to check, and most of all in medicine, where I have chosen to spend my career precisely because the stakes are real. I still think Anthropic has built something remarkable; Fable 5 is the best evidence I have that the field is producing tools worth caring about. None of that is rhetorical throat-clearing. It is the reason I wrote feedback through the product flow two days in a row instead of quietly switching tools, and it is the reason this essay is a design-failure analysis rather than a goodbye.

But a thing can be remarkable and still be reached by a broken path. Safety mechanisms that are too broad, too opaque, and too poorly audited do not just fail to stop bad actors. They actively undermine the professionals trying to use these systems responsibly — the ones working in the open, on legitimate products, who would gladly hand over their entire history because there is nothing in it to hide. A classifier that repeatedly fires on cooperative professional users is not merely being conservative. It is misallocating its suspicion — spending it on the people least likely to misuse anything, and teaching them, one downgrade at a time, to trust the tool a little less.

Anthropic told the government, in effect, that safety decisions must be transparent, fair, clear, and grounded in technical facts, and that a narrow signal does not justify a broad restriction. I think that is correct. I think it is correct all the way down.

Anthropic is right that broad, opaque, poorly justified safety decisions are dangerous. That is exactly why its own product should stop making them.

And if I am allowed one line for the engineers rather than the executives, it is the lesson I actually took from Fable 5, the one I will carry into my own systems: safety is not the absence of a topic. It is precision — precise enough to recognize the people trying to build safer systems, and to get out of their way.

The two feedback reports quoted in Section 3 are reproduced from the actual messages I submitted to Anthropic through the product feedback flow, on consecutive days, immediately after the incidents they describe. They are primary documents, not reconstructions. They are my own contemporaneous, user-side observations — not Anthropic's confirmation of any internal mechanism. Where I use terms like "classifier," "guard," or "routing layer," I use them descriptively, for the observable serving-path behavior, not as a claim about Anthropic's internal architecture.

Statements about the U.S. government's export-control directive and Anthropic's response are drawn from Anthropic's public statement and from contemporaneous public reporting (June 12–13, 2026). I have phrased these cautiously throughout — "as publicly reported," "as Anthropic stated" — because the exact government rationale is not fully visible to users, and I have made no claim of coordination, intent, or theater. The description of Fable 5's safeguards — that certain-topic queries are answered instead by Opus 4.8, that the safeguards can flag benign requests, and that they trigger in under five percent of sessions — is drawn from Anthropic's own launch announcement for Fable 5 and Mythos 5. Dario Amodei's essay "Policy on the AI Exponential" is referenced as published at darioamodei.com; I have characterized its argument for scoped government block-authority from public summaries and the essay itself, and have not overstated it.

source & further reading

dev.to — original article Web dev is fake.None of this actually works I gave AI a free watch — because knowing the time isn't the same as having a sense of time (ai-watch.eu) The Model Didn't Escape the Sandbox. The Sandbox Was Misconfigured.

Anthropic Was Right About One Thing: Broad Safety Decisions Are Dangerous

Run your AI side-project on zahid.host