{"slug": "tamper-resistance-is-a-moving-target-we-might-not-hit", "title": "Tamper-Resistance is a Moving Target We Might Not Hit", "summary": "Open-weight AI models are fundamentally harder to safeguard than closed models because adversaries can extract weights directly and retrain them to bypass embedded protections. In September 2025, the open-source genomic language model Evo2 successfully designed a bacteriophage that outperformed a natural pathogen in infecting E. coli, marking the first instance of an AI-generated pathogen surpassing a natural one. Most biological AI models remain open-source with few effective safeguards, leaving the field of AI-bio risk mitigation struggling with unclear threat models and a cat-and-mouse dynamic between new attacks and defenses.", "body_md": "Crossposted with permission from [Substack](https://leewall.substack.com/p/tamper-resistance-is-a-moving-target) by the EA Forum Team.\n\n*Subtitle: Open-Weight Model Safeguards and Biorisk*\n\nThe AI safety and biosecurity communities are abuzz with concerns about open-weight models. Increasingly powerful models threaten [both to democratize the requisite knowledge for causing harm with biology, and to raise the ceiling of harm that a bad actor can cause](https://www.nti.org/analysis/articles/statement-on-biosecurity-risks-at-the-convergence-of-ai-and-the-life-sciences/), which has led to [efforts to develop safeguards](https://www.science.org/content/article/built-safeguards-might-stop-ai-designing-bioweapons) to prevent misuse. When models are open-weight, though, they are fundamentally more challenging to safeguard than closed models.\n\nA safeguard on an open-weight model needs to be *embedded within the weights themselves*, not in a software wrapper around the weights, as a bad actor can simply extract the weights directly. They can then further train the model to erode or bypass many of the safeguards that would otherwise prevent them from misusing a model's more dangerous capabilities.\n\nTo complicate things, the success of generative AI in domains like language and image generation has spurred some developers to build generative AI models for biology. One such model, [Evo2](http://evo), is essentially an LLM that has been trained on over 128,000 species across all domains of life instead of a corpus of digitized text – a \"genomic language model.\" Fundamentally, it does the same thing as an LLM: it takes tokens as input, then predicts the next token. For Evo2, though, this means predicting nucleotides rather than readable text.\n\nIn September 2025, that model was used to [design a bacteriophage](https://www.biorxiv.org/content/10.1101/2025.09.12.675911v1) that was then made in the lab, and it *successfully infected *E. coli* bacteria, demonstrating higher fitness than an existing *E. coli*-infecting bacteriophage*. That is to say, we have now seen the first instance of an AI-generated pathogen that can outperform a natural pathogen.\n\nOne might hope that these models are closed-access, heavily regulated, and closely monitored, and that they have safeguards preventing misuse just in case the weights leak. Instead, most biological AI models are open-source, [few have safeguards](https://epoch.ai/blog/expanding-our-analysis-of-biological-ai-models), and it is unclear safeguards — where they exist — are effective at preventing misuse. The field of AI-Bio risks is still determining how much risk existing models actually pose, and the field of model safeguards is still largely a [cat-and-mouse game](https://arxiv.org/abs/2408.00761) where [new, clever attacks](https://arxiv.org/abs/2412.07097) are developed as soon as new, clever defenses are built. Consequently, AI-Bio risk mitigation efforts are swimming in murky water where prioritization can be difficult, threat models are often unclear, and progress is hard to measure.\n\nThis post is an effort to take a step back and evaluate the theory of change for open-weight safeguards as a tool for mitigating risks of catastrophic harm from the misuse of biotechnology.\n\n*Note: I'm going to assume that the audience of this post is already coming in with some context on open-weight models and the risks they pose in biosecurity. If you lack this context, check out my colleague Sophie's **primer** on the topic! If you are already mostly familiar with this topic, skip the first section.*\n\nCan we make a model bad at helping with harm in a way that is not easy for an adversary with weight access to undo?\n\nThis is the central question of the quest for *tamper-resistant safeguards* for open-weight models. The point of a safeguard is to prevent access to some dangerous capability, while \"tamper-resistance\" is this additional challenge of making it difficult for someone trying to misuse a model to evade or remove these safeguards, even though they have full access to modify any part of the model they want. The challenge here is analogous to designing a clock's gears such that if you handed someone the clock and a toolkit, they wouldn't be able to make the clock hit a certain number. This has turned out to be about as difficult for AI models as it would be for clocks.\n\nFor a clock, part of what might make this difficult is the asymmetry between the expertise of the clockmaker and the tamperer. For AI models, this may be increasingly inverted. Models themselves are improving in the very capabilities that enable someone to tamper with them. The ML engineering expertise required to fine-tune a model is a fragile barrier when the AI industry is explicitly aiming at building models that are able to automate AI research. This is increasingly true as models are put into harnesses that allow them to act as agents, autonomously interacting and assisting users. The more capable the agents that a bad actor has access to, the more expertise barriers are eroded. If a bad actor has a model that can assist them in the engineering tasks necessary for circumventing a safeguard, tamper-resistance largely collapses to a problem of increasing the *cost* for them to do this.\n\nSafeguards for open-weight models are improving in tamper-resistance, but the barriers to circumvent them are still frighteningly low.\n\nTamper-resistance is generally measured as the number of adversarial fine-tuning steps a safeguard withstands before the dangerous capability returns. Refusal training falls in [a few steps](https://arxiv.org/abs/2310.03693). TAR (\"Tampering Attack Resistance\") trains models to resist being fine-tuned back to dangerous behavior, and it resists [hundreds of steps](https://arxiv.org/abs/2412.07097) on an 8B-parameter model. The biggest leap in progress came from \"Deep Ignorance\" in 2025: not training on dangerous data in the first place (or, more precisely, building a classifier for what's dangerous and what isn't, then aggressively filtering out the dangerous data before training). This provided resistance up to 10,000 steps and 300 million tokens of adversarial fine-tuning on ~7 billion parameter models.\n\nThis arc sounds promising, but it starts to look less promising when considering the raw dollar cost to undo them. The Deep Ignorance paper's 10,000 steps of fine-tuning likely cost ~$200.\n\nGranted, these costs are for relatively small models. The genomic language model that designed a novel bacteriophage, Evo2, has 40 billion parameters, and the largest frontier LLMs are in the trillions. It's unclear how these methods scale, and at the time of this writing (and as far as I know), there have thus far been no rigorous studies of the scaling laws for the tamper-resistance of different safeguards. The variables determining this are primarily (1) the cost per fine-tuning step, and (2) the number of steps needed. It is anyone's guess how the number of steps needed will change as a function of parameter count for different safeguards. As I will argue later, this question is a strategic crux for safeguards in general.\n\nAlthough smashing successes for the field of open-weight safeguards have been few and far between, it is not entirely outside the realm of possibility that something might be developed which is way better than current safeguards. Since Deep Ignorance was multiple orders of magnitude better than what came before it, maybe there is something orders of magnitude better that could be discovered soon.\n\nBut there is arguably a ceiling to how effective a safeguard can be. A safeguard's tamper resistance only matters up to the point where it is cheaper to train the same model from scratch than to fine-tune to remove the safeguard. An adversary is not going to waste their time or money fine-tuning if they can just train the model themselves from randomly initialized weights.\n\nHow close could a safeguard push fine-tuning costs toward that ceiling?\n\nWe don't know, but the theoretical maximum tamper-resistance might cap out lower than the full training cost. Since training involves ingesting large volumes of data that are likely irrelevant for the intended misuse while fine-tuning may only require the misuse-relevant data, it's plausible that this cap is *much* lower. Still, we can use the heuristic of training cost as a ceiling to understand the constraints on the impactfulness of tamper-resistance more broadly.\n\nOne of several elephants in the room in discussing efforts to build tamper-resistant safeguards is compute cost. Both the cost to fine-tune models and the falling cost of training a model from scratch.\n\nEpoch estimates that algorithmic progress has reduced the compute needed to reach a given performance level by [roughly 3× per year](https://epoch.ai/trends), while hardware performance per dollar has improved at about 1.4× per year. These rates come with a lot of caveats and the algorithmic progress number in particular may not be amenable to clean extrapolation, but if we take them at face value anyways, this gives approximately 4.2× per year in effective cost decline.\n\nTo ground the arguments here in numbers, I'll present some back-of-the-envelope calculations under the conservative, hardware-only progress number of **1.4×/year** and the naively extrapolated **4.2×/year**. The former is a floor; the latter reflects Epoch's central estimate but is uncertain, since algorithmic progress is dominated by discontinuous jumps rather than smooth trends. It's also possible that another step-change in algorithmic progress could result in much faster progress, but this analysis neglects that scenario to avoid skewing results toward pessimism.\n\nConsider Evo2 again. Evo2 was trained at an [estimated cost of ~$10M](https://ratankaliani.com/posts/evo2/).\n\nHere is how the training cost for Evo2 drops under each rate of cost decline:\n\nUnder hardware-only progress, training costs stay above $100K for over a decade. Under combined progress at 4.2×/yr, training an Evo2-sized model drops below $100K in about 4 years and below $10K in about 6. A miracle safeguard that is so resistant to fine-tuning that it costs the same to recover capability as it does to train the model would therefore pose only a modest cost barrier to an adversary after ~5 years under a naive extrapolation of current trends.\n\nUltimately, with costs falling and hardware efficiency nowhere near the theoretical limit, safeguards are best framed as a play to buy time to deploy other defensive measures. How much time is enough, and how much better do techniques need to get to buy that amount of time? The first question is out of scope here, but suppose the goal is ~3 years of protection to [deploy defenses](https://80000hours.org/podcast/episodes/andrew-snyder-beattie-four-pillars-biosecurity-pandemic/). Over three years, costs decline by ~2.7× at 1.4×/yr, and ~74× at 4.2×/yr.\n\nFor a safeguard to remain above an actor's budget for 3 years, the cost to circumvent [1] that safeguard today must exceed the budget multiplied by the total cost decline:\n\nTo clarify, this means that if the effective compute cost continues to decline at the rate it has so far, keeping the circumvention cost above $10K for a full three years requires a safeguard that would take ~$700K to break today *just in compute costs.*\n\nIt's also important to note that an actor is not only spending their budget on compute. [The costs of other aspects of bioweapon production](https://www.rand.org/pubs/perspectives/PEA4102-1.html) may filter out low-budget actors more than compute does, while the class of actors able to spend money on making a biological construct in the lab and weaponizing it may already not be dissuaded by a $10K price tag on breaking in silico safeguards.\n\nThe current SOTA, [Deep Ignorance](https://deepignorance.ai/), resisted ~$200 of fine-tuning on a 7B-parameter language model. Granted, the benchmarks used to measure capability recovery in these experiments only measure a proxy for dangerous capabilities; the relationship between performance on proxy benchmarks and actual dangerous capability is muddy, and this should be kept in mind as a caveat for interpreting tamper-resistance.\n\nThat said, if we take this number and run with it as the circumvention cost, keeping the circumvention cost above $10K USD under hardware-only progress for 3 years still requires ~100× higher tamper-resistance than this.\n\nOn the bright side, the $200 data point is at 7B parameters. The models most relevant to biorisk are potentially much bigger. We don't currently know how the costs will scale with parameter size, or how biological capabilities themselves will scale with size. If tamper-resistance scales favorably with parameter count, this cost could rise enough for large models that safeguards might pose a more meaningful barrier.\n\nAt the time of this writing, there appear to be no published measurements of how tamper-resistance scales with model size for any safeguard method or type of model. [2] Still, we can roughly determine what scaling would need to look like to hit the targets from the previous section.\n\nFor simplicity, let's assume a power law for circumvention cost *C*:\n\n*C = C₀ · (N/N₀)ᵏ*\n\nHere, *N₀* is some number of parameters for which we know the circumvention cost is *C₀*. To anchor our estimates, we can set *N₀* to 7B and vary the starting circumvention cost, using Deep Ignorance's ~$200 as the lower bound.\n\nWe can then ask \"What does *k* need to be to protect a model for 3 years against a $10K-budget actor, under different assumptions about size, actual circumvention cost, and the rate of cost decline?\"\n\nAt 70B (about double the size of Evo2), sublinear scaling is viable only when *C₀* ≥ $10K under hardware-only progress. This requires the circumvention cost for the best safeguard at 7B to already be 50× above Deep Ignorance and neglects any algorithmic progress. Under 4.2×/yr cost decline at 70B, even starting with a circumvention cost of $10K requires nearly quadratic scaling.\n\nAt 700B, sublinear scaling is sufficient if *C₀* ≥ $1K under hardware-only progress, and if *C₀* ≥ $10K under combined progress, while superlinear scaling is necessary if $200 is closer to the actual circumvention cost today.\n\nThese BOTECs suggest that **under reasonable assumptions about cost decline, a safeguard likely needs to either scale superlinearly with model size, or be an order of magnitude or two better than the current best ones for circumvention cost to remain over $10K for 3 years for models with <1T parameters.**\n\nTo make matters worse, we might also expect sublinear scaling instead for many safeguards. Larger models are more sample-efficient, so an adversary may need fewer fine-tuning steps to recover a given capability. They also generalize more effectively from adjacent knowledge, which could allow them to reconstruct filtered capabilities from related data that wasn't removed.\n\nIt's unclear how large models need to be before they are \"risky enough,\" so although these arguments point toward pessimism for models closer to Evo2-size, it's possible that models with powerful biological capabilities are many times larger. This would make safeguards with modest but favorable scaling trends a potentially viable strategy. It's therefore important to track both how safeguard tamper-resistance scales with size and how the biological capabilities themselves scale to determine mitigation priorities.\n\nTamper-resistance's power as a cost barrier hinges on algorithmic progress, tamper-resistance scaling, and the potential for step-change improvements in tamper-resistance over current safeguards. To price out bad actors with fine-tuning compute cost, we likely either need safeguards multiple order of magnitudes more tamper-resistant than the current best, or superlinear scaling for tamper-resistance with model size.\n\nIf scaling tamper-resistance shows promise, we should focus on safeguarding big models, not small ones where the safeguards will be trivial to undo in the future *unless* we have reason to believe an adversary would not *try* to undo safeguards on the small models. Scaling properties are a crux for any theory of change that involves making a cost barrier out of fine-tuning compute. This points toward deprioritizing safeguards that do not appear to exhibit superlinear scaling in tamper-resistance with model size, unless those safeguards have a different theory of change altogether.\n\nIf scaling does *not* show promise, we're in a much tougher situation. With linear scaling, for instance, we would need major breakthroughs in tamper-resistant safeguards that are on par with the size of capabilities-focused algorithmic progress breakthroughs in years past. We'd need another two orders of magnitude to go from the current estimated circumvention cost to a circumvention cost of ~$10K, which entails multiple leaps of progress as large as pretraining data filtering was over the safeguards that preceded it.\n\nIt's not clear that such safeguards exist, and even if they do, how much time will it take to discover them?\n\nIf they take 2 years to develop, there is a real chance that they will be made irrelevant as soon as they are made. This is further complicated by timelines to actual deployment, which requires either adoption by developers or mandates from above by governments, both of which suffer from single-defector problems. If the silver-bullet safeguard is developed, what forcing function exists to ensure open-weight model developers globally adopt it? How long does *this* take?\n\nWe also shouldn't underestimate the dark horse that is algorithmic progress. Putting numbers down on something as ghostly and intangible as algorithmic progress is [difficult](https://epoch.ai/gradient-updates/the-least-understood-driver-of-ai-progress), and this is a caveat to the aggressive 4.2× cost decline estimates above.\n\nHowever, there is a lot of weight pulling the trends toward aggressive timelines. We have on the one side, the *entire AI industry* pushing costs down through algorithmic and hardware progress, and a handful of AI safety teams, academics, and nonprofits trying to push the tamper-resistance ball uphill against this. There is currently far more funding, talent, and time being poured into the exact drivers of progress that cut against tamper-resistance than into the whole field of open-weight model safeguards![[3]](https://forum.effectivealtruism.org/feed.xml#fnnpmazlj1uo)\n\nOverall, the picture looks quite grim for improving open weight safeguards as a means to reduce biorisk. Train-from-scratch costs and expertise barriers are declining, and the safeguards that exist are insufficient to pose a meaningful barrier at current levels of tamper-resistance. The ceiling is falling, and we haven't even reached the floor. There is still a silver lining if we can show that safety scales, or if we can develop safeguards with a different theory of change than making fine-tuning costs or expertise into barriers. If this doesn't pan out, though, it may be time for a strategic recalibration on AIxBio risk mitigation for open-weight models. I'll lay out specific ideas for this in future posts.\n\n**Acknowledgements:** Special thanks to my ERA mentor Aaron Maiwald for the discussions that initially spawned this blogpost and shaped many of the points made here, to my ERA RM Lennart Justen for the push to prioritize this, and to members of the Substack Mafia, Annabella Wheatley, Jack Douglass, and Sam Smith — who helped to push this over the finish line! And thanks to Zuzanna Matuszewska plus all those already mentioned for helpful feedback!\n\nOne note on language: \"circumvention cost\" here means the cost to fine-tune a safeguarded model until it recovers enough dangerous capability to meaningfully uplift an attacker. In reality, there's no clean moment where a safeguard \"breaks,\" and there's an entire subfield around determining whether a model even has the dangerous bio-capability, so this is much fuzzier in reality than this analysis makes it.\n\nThere is a group working on this for protein language models; reach out if you'd like to know more!\n\nCaveat: I haven't formally estimated this difference.", "url": "https://wpnews.pro/news/tamper-resistance-is-a-moving-target-we-might-not-hit", "canonical_source": "https://forum.effectivealtruism.org/posts/cGm8cyhGx3ZKzQxbz/tamper-resistance-is-a-moving-target-we-might-not-hit", "published_at": "2026-06-04 13:56:14+00:00", "updated_at": "2026-06-04 14:09:04.915592+00:00", "lang": "en", "topics": ["artificial-intelligence", "large-language-models", "ai-safety", "ai-policy", "generative-ai"], "entities": ["Evo2", "Substack", "EA Forum Team", "NTI"], "alternates": {"html": "https://wpnews.pro/news/tamper-resistance-is-a-moving-target-we-might-not-hit", "markdown": "https://wpnews.pro/news/tamper-resistance-is-a-moving-target-we-might-not-hit.md", "text": "https://wpnews.pro/news/tamper-resistance-is-a-moving-target-we-might-not-hit.txt", "jsonld": "https://wpnews.pro/news/tamper-resistance-is-a-moving-target-we-might-not-hit.jsonld"}}