Tamper-Resistance is a Moving Target We Might Not Hit

wpnews.pro

Crossposted with permission from Substack by the EA Forum Team. Subtitle: Open-Weight Model Safeguards and Biorisk

The AI safety and biosecurity communities are abuzz with concerns about open-weight models. Increasingly powerful models threaten both to democratize the requisite knowledge for causing harm with biology, and to raise the ceiling of harm that a bad actor can cause, which has led to efforts to develop safeguards to prevent misuse. When models are open-weight, though, they are fundamentally more challenging to safeguard than closed models.

A safeguard on an open-weight model needs to be embedded within the weights themselves, not in a software wrapper around the weights, as a bad actor can simply extract the weights directly. They can then further train the model to erode or bypass many of the safeguards that would otherwise prevent them from misusing a model's more dangerous capabilities.

To complicate things, the success of generative AI in domains like language and image generation has spurred some developers to build generative AI models for biology. One such model, Evo2, is essentially an LLM that has been trained on over 128,000 species across all domains of life instead of a corpus of digitized text – a "genomic language model." Fundamentally, it does the same thing as an LLM: it takes tokens as input, then predicts the next token. For Evo2, though, this means predicting nucleotides rather than readable text.

In September 2025, that model was used to design a bacteriophage that was then made in the lab, and it successfully infected E. coli bacteria, demonstrating higher fitness than an existing E. coli-infecting bacteriophage. That is to say, we have now seen the first instance of an AI-generated pathogen that can outperform a natural pathogen.

One might hope that these models are closed-access, heavily regulated, and closely monitored, and that they have safeguards preventing misuse just in case the weights leak. Instead, most biological AI models are open-source, few have safeguards, and it is unclear safeguards — where they exist — are effective at preventing misuse. The field of AI-Bio risks is still determining how much risk existing models actually pose, and the field of model safeguards is still largely a cat-and-mouse game where new, clever attacks are developed as soon as new, clever defenses are built. Consequently, AI-Bio risk mitigation efforts are swimming in murky water where prioritization can be difficult, threat models are often unclear, and progress is hard to measure.

This post is an effort to take a step back and evaluate the theory of change for open-weight safeguards as a tool for mitigating risks of catastrophic harm from the misuse of biotechnology.

Note: I'm going to assume that the audience of this post is already coming in with some context on open-weight models and the risks they pose in biosecurity. If you lack this context, check out my colleague Sophie's primer on the topic! If you are already mostly familiar with this topic, skip the first section.

Can we make a model bad at helping with harm in a way that is not easy for an adversary with weight access to undo?

This is the central question of the quest for tamper-resistant safeguards for open-weight models. The point of a safeguard is to prevent access to some dangerous capability, while "tamper-resistance" is this additional challenge of making it difficult for someone trying to misuse a model to evade or remove these safeguards, even though they have full access to modify any part of the model they want. The challenge here is analogous to designing a clock's gears such that if you handed someone the clock and a toolkit, they wouldn't be able to make the clock hit a certain number. This has turned out to be about as difficult for AI models as it would be for clocks.

For a clock, part of what might make this difficult is the asymmetry between the expertise of the clockmaker and the tamperer. For AI models, this may be increasingly inverted. Models themselves are improving in the very capabilities that enable someone to tamper with them. The ML engineering expertise required to fine-tune a model is a fragile barrier when the AI industry is explicitly aiming at building models that are able to automate AI research. This is increasingly true as models are put into harnesses that allow them to act as agents, autonomously interacting and assisting users. The more capable the agents that a bad actor has access to, the more expertise barriers are eroded. If a bad actor has a model that can assist them in the engineering tasks necessary for circumventing a safeguard, tamper-resistance largely collapses to a problem of increasing the cost for them to do this. Safeguards for open-weight models are improving in tamper-resistance, but the barriers to circumvent them are still frighteningly low.

Tamper-resistance is generally measured as the number of adversarial fine-tuning steps a safeguard withstands before the dangerous capability returns. Refusal training falls in a few steps. TAR ("Tampering Attack Resistance") trains models to resist being fine-tuned back to dangerous behavior, and it resists hundreds of steps on an 8B-parameter model. The biggest leap in progress came from "Deep Ignorance" in 2025: not training on dangerous data in the first place (or, more precisely, building a classifier for what's dangerous and what isn't, then aggressively filtering out the dangerous data before training). This provided resistance up to 10,000 steps and 300 million tokens of adversarial fine-tuning on ~7 billion parameter models.

This arc sounds promising, but it starts to look less promising when considering the raw dollar cost to undo them. The Deep Ignorance paper's 10,000 steps of fine-tuning likely cost ~$200.

Granted, these costs are for relatively small models. The genomic language model that designed a novel bacteriophage, Evo2, has 40 billion parameters, and the largest frontier LLMs are in the trillions. It's unclear how these methods scale, and at the time of this writing (and as far as I know), there have thus far been no rigorous studies of the scaling laws for the tamper-resistance of different safeguards. The variables determining this are primarily (1) the cost per fine-tuning step, and (2) the number of steps needed. It is anyone's guess how the number of steps needed will change as a function of parameter count for different safeguards. As I will argue later, this question is a strategic crux for safeguards in general.

Although smashing successes for the field of open-weight safeguards have been few and far between, it is not entirely outside the realm of possibility that something might be developed which is way better than current safeguards. Since Deep Ignorance was multiple orders of magnitude better than what came before it, maybe there is something orders of magnitude better that could be discovered soon.

But there is arguably a ceiling to how effective a safeguard can be. A safeguard's tamper resistance only matters up to the point where it is cheaper to train the same model from scratch than to fine-tune to remove the safeguard. An adversary is not going to waste their time or money fine-tuning if they can just train the model themselves from randomly initialized weights.

How close could a safeguard push fine-tuning costs toward that ceiling?

We don't know, but the theoretical maximum tamper-resistance might cap out lower than the full training cost. Since training involves ingesting large volumes of data that are likely irrelevant for the intended misuse while fine-tuning may only require the misuse-relevant data, it's plausible that this cap is much lower. Still, we can use the heuristic of training cost as a ceiling to understand the constraints on the impactfulness of tamper-resistance more broadly.

One of several elephants in the room in discussing efforts to build tamper-resistant safeguards is compute cost. Both the cost to fine-tune models and the falling cost of training a model from scratch.

Epoch estimates that algorithmic progress has reduced the compute needed to reach a given performance level by roughly 3× per year, while hardware performance per dollar has improved at about 1.4× per year. These rates come with a lot of caveats and the algorithmic progress number in particular may not be amenable to clean extrapolation, but if we take them at face value anyways, this gives approximately 4.2× per year in effective cost decline.

To ground the arguments here in numbers, I'll present some back-of-the-envelope calculations under the conservative, hardware-only progress number of 1.4×/year and the naively extrapolated 4.2×/year. The former is a floor; the latter reflects Epoch's central estimate but is uncertain, since algorithmic progress is dominated by discontinuous jumps rather than smooth trends. It's also possible that another step-change in algorithmic progress could result in much faster progress, but this analysis neglects that scenario to avoid skewing results toward pessimism.

Consider Evo2 again. Evo2 was trained at an estimated cost of ~$10M.

Here is how the training cost for Evo2 drops under each rate of cost decline:

Under hardware-only progress, training costs stay above $100K for over a decade. Under combined progress at 4.2×/yr, training an Evo2-sized model drops below $100K in about 4 years and below $10K in about 6. A miracle safeguard that is so resistant to fine-tuning that it costs the same to recover capability as it does to train the model would therefore pose only a modest cost barrier to an adversary after ~5 years under a naive extrapolation of current trends.

Ultimately, with costs falling and hardware efficiency nowhere near the theoretical limit, safeguards are best framed as a play to buy time to deploy other defensive measures. How much time is enough, and how much better do techniques need to get to buy that amount of time? The first question is out of scope here, but suppose the goal is ~3 years of protection to deploy defenses. Over three years, costs decline by ~2.7× at 1.4×/yr, and ~74× at 4.2×/yr.

For a safeguard to remain above an actor's budget for 3 years, the cost to circumvent [1] that safeguard today must exceed the budget multiplied by the total cost decline: To clarify, this means that if the effective compute cost continues to decline at the rate it has so far, keeping the circumvention cost above $10K for a full three years requires a safeguard that would take ~$700K to break today just in compute costs.

It's also important to note that an actor is not only spending their budget on compute. The costs of other aspects of bioweapon production may filter out low-budget actors more than compute does, while the class of actors able to spend money on making a biological construct in the lab and weaponizing it may already not be dissuaded by a $10K price tag on breaking in silico safeguards.

The current SOTA, Deep Ignorance, resisted ~$200 of fine-tuning on a 7B-parameter language model. Granted, the benchmarks used to measure capability recovery in these experiments only measure a proxy for dangerous capabilities; the relationship between performance on proxy benchmarks and actual dangerous capability is muddy, and this should be kept in mind as a caveat for interpreting tamper-resistance.

That said, if we take this number and run with it as the circumvention cost, keeping the circumvention cost above $10K USD under hardware-only progress for 3 years still requires ~100× higher tamper-resistance than this.

On the bright side, the $200 data point is at 7B parameters. The models most relevant to biorisk are potentially much bigger. We don't currently know how the costs will scale with parameter size, or how biological capabilities themselves will scale with size. If tamper-resistance scales favorably with parameter count, this cost could rise enough for large models that safeguards might pose a more meaningful barrier.

At the time of this writing, there appear to be no published measurements of how tamper-resistance scales with model size for any safeguard method or type of model. [2] Still, we can roughly determine what scaling would need to look like to hit the targets from the previous section.

For simplicity, let's assume a power law for circumvention cost *C*:

*C = C₀ · (N/N₀)ᵏ*

Here, N₀ is some number of parameters for which we know the circumvention cost is C₀. To anchor our estimates, we can set N₀ to 7B and vary the starting circumvention cost, using Deep Ignorance's ~$200 as the lower bound.

We can then ask "What does k need to be to protect a model for 3 years against a $10K-budget actor, under different assumptions about size, actual circumvention cost, and the rate of cost decline?" At 70B (about double the size of Evo2), sublinear scaling is viable only when C₀ ≥ $10K under hardware-only progress. This requires the circumvention cost for the best safeguard at 7B to already be 50× above Deep Ignorance and neglects any algorithmic progress. Under 4.2×/yr cost decline at 70B, even starting with a circumvention cost of $10K requires nearly quadratic scaling.

At 700B, sublinear scaling is sufficient if C₀ ≥ $1K under hardware-only progress, and if C₀ ≥ $10K under combined progress, while superlinear scaling is necessary if $200 is closer to the actual circumvention cost today.

These BOTECs suggest that under reasonable assumptions about cost decline, a safeguard likely needs to either scale superlinearly with model size, or be an order of magnitude or two better than the current best ones for circumvention cost to remain over $10K for 3 years for models with <1T parameters.

To make matters worse, we might also expect sublinear scaling instead for many safeguards. Larger models are more sample-efficient, so an adversary may need fewer fine-tuning steps to recover a given capability. They also generalize more effectively from adjacent knowledge, which could allow them to reconstruct filtered capabilities from related data that wasn't removed.

It's unclear how large models need to be before they are "risky enough," so although these arguments point toward pessimism for models closer to Evo2-size, it's possible that models with powerful biological capabilities are many times larger. This would make safeguards with modest but favorable scaling trends a potentially viable strategy. It's therefore important to track both how safeguard tamper-resistance scales with size and how the biological capabilities themselves scale to determine mitigation priorities.

Tamper-resistance's power as a cost barrier hinges on algorithmic progress, tamper-resistance scaling, and the potential for step-change improvements in tamper-resistance over current safeguards. To price out bad actors with fine-tuning compute cost, we likely either need safeguards multiple order of magnitudes more tamper-resistant than the current best, or superlinear scaling for tamper-resistance with model size.

If scaling tamper-resistance shows promise, we should focus on safeguarding big models, not small ones where the safeguards will be trivial to undo in the future *unless* we have reason to believe an adversary would not *try* to undo safeguards on the small models. Scaling properties are a crux for any theory of change that involves making a cost barrier out of fine-tuning compute. This points toward deprioritizing safeguards that do not appear to exhibit superlinear scaling in tamper-resistance with model size, unless those safeguards have a different theory of change altogether.

If scaling does *not* show promise, we're in a much tougher situation. With linear scaling, for instance, we would need major breakthroughs in tamper-resistant safeguards that are on par with the size of capabilities-focused algorithmic progress breakthroughs in years past. We'd need another two orders of magnitude to go from the current estimated circumvention cost to a circumvention cost of ~$10K, which entails multiple leaps of progress as large as pretraining data filtering was over the safeguards that preceded it.

It's not clear that such safeguards exist, and even if they do, how much time will it take to discover them?

If they take 2 years to develop, there is a real chance that they will be made irrelevant as soon as they are made. This is further complicated by timelines to actual deployment, which requires either adoption by developers or mandates from above by governments, both of which suffer from single-defector problems. If the silver-bullet safeguard is developed, what forcing function exists to ensure open-weight model developers globally adopt it? How long does this take? We also shouldn't underestimate the dark horse that is algorithmic progress. Putting numbers down on something as ghostly and intangible as algorithmic progress is difficult, and this is a caveat to the aggressive 4.2× cost decline estimates above.

However, there is a lot of weight pulling the trends toward aggressive timelines. We have on the one side, the entire AI industry pushing costs down through algorithmic and hardware progress, and a handful of AI safety teams, academics, and nonprofits trying to push the tamper-resistance ball uphill against this. There is currently far more funding, talent, and time being poured into the exact drivers of progress that cut against tamper-resistance than into the whole field of open-weight model safeguards

Overall, the picture looks quite grim for improving open weight safeguards as a means to reduce biorisk. Train-from-scratch costs and expertise barriers are declining, and the safeguards that exist are insufficient to pose a meaningful barrier at current levels of tamper-resistance. The ceiling is falling, and we haven't even reached the floor. There is still a silver lining if we can show that safety scales, or if we can develop safeguards with a different theory of change than making fine-tuning costs or expertise into barriers. If this doesn't pan out, though, it may be time for a strategic recalibration on AIxBio risk mitigation for open-weight models. I'll lay out specific ideas for this in future posts.

Acknowledgements: Special thanks to my ERA mentor Aaron Maiwald for the discussions that initially spawned this blogpost and shaped many of the points made here, to my ERA RM Lennart Justen for the push to prioritize this, and to members of the Substack Mafia, Annabella Wheatley, Jack Douglass, and Sam Smith — who helped to push this over the finish line! And thanks to Zuzanna Matuszewska plus all those already mentioned for helpful feedback!

One note on language: "circumvention cost" here means the cost to fine-tune a safeguarded model until it recovers enough dangerous capability to meaningfully uplift an attacker. In reality, there's no clean moment where a safeguard "breaks," and there's an entire subfield around determining whether a model even has the dangerous bio-capability, so this is much fuzzier in reality than this analysis makes it.

There is a group working on this for protein language models; reach out if you'd like to know more!

Caveat: I haven't formally estimated this difference.

source & further reading

forum.effectivealtruism.org — original article How I ran the Bogotá Hub that claimed two of the top spots at Apart's Global South AI Safety Hackathon ALTER Israel 2026 Mid-Year Update Consciousness doesn’t do that

Tamper-Resistance is a Moving Target We Might Not Hit

Run your AI side-project on zahid.host