Short Timelines Favor Control, Long Timelines Favor Infrastructure Security

wpnews.pro

TL:DR, A common assumption is that extending AGI timelines reduces risk straightforwardly by giving alignment researchers more time. I suspect the relationship is more complicated. Longer timelines may reduce accidental misalignment risk while simultaneously increasing risks from deliberate misuse and sabotage [1]. If so, extending

-- My background is in vulnerability research and critical infrastructure security, including work on anti-tampering and attestation systems. This post reflects my current thinking on how AI verification and AI security relate, and how the highest-EV focus shifts under different timeline assumptions.

Epistemically I am cautious and on the side of intuition rather than a formalised prediction model. My key argument is that instead of just decreasing risk, advances in alignment might redistribute it which renders AI Security more important over time.

I believe that AI Safety is an iterative domain where new failure modes arrive and are discovered as models gain capabilities with scale. Our ability to tackle them depends on both the rate at which capability produces new failure modes and our ability to keep up with fixes:

A particular challenge I see is that the demand to join AI Safety is bottlenecked and outrunning capacity (e.g. MATS acceptance rate dropped by ~11pp from 2023 to 2025).

As a threat-modelling exercise we could look at the following illustrative takeoff scenarios and observe how the dominant threat might shift depending on our ability to keep up with emergent failure rates and attacker upskilling time:

Failure rate > Fix rate

Let's assume a rapid recursive self-improvement loop forecasts AGI in 6 months and the failure rate exceeds our fix rate.

Given the rapid, sudden surge in capability I believe:

In this ultra-short timeline we likely get increased value from adopting an assume breach mentality and primarily focus on containment & control strategies as well as incident detection & response. Particularly valuable projects could be helping frontier labs with setting up offline backup procedures to pre-emptively counter future models poisoning its previous versions for extortion and building sandboxing environments that treat capable agents as misaligned and hostile.

Additionally, by implementing AI Verification technology we could support frontier labs and international treaties in slowing down the racing conditions that cause the rapid capability increase. This would stretch the timeline and buy AI Safety researchers much needed iteration time to properly investigate failure modes and alignment.

Failure rate = Fix rate

If we can stretch out the timeline to a point where the rate of newly introduced failures is roughly on line with our fix rate, say 12 months, I can see the following happening: The same timeline mechanism that buys us more time to fix the alignment problem also enables sophisticated adversaries to have more time for deliberate upskilling, experimentation and exploitation of frontier AI for political purposes.

By leveraging larger timeframes we could take larger bets like combining model control with interpretability experiments. We would, however, also increasingly have to focus against human-guided threats including building 4-eye principle access control for accessing strong Cyber or CBRN capabilities and building model poisoning detection & removal systems.

Failure rate <= Fix rate

Now let's assume progress in AI Verification has led to frontier labs trusting each other enough to slow down capability training, which leads to an extended timeline of 24+ months and fix rate staying constant or exceeding the failure rate:

As timelines extend dynamics that closely resemble classical exploit–patch cycles or videogame cheating could emerge, leading to rapid treadmill-style back-and-forth between attackers and defenders. Particularly valuable projects could be developing distillation monitoring & prevention for frontier models to prevent cover capability exfiltration, investing heavily into securing the hardware and software supply chain of frontier labs and investing in fellowships & apprenticeships to upskill more defenders.

In practice it is not unlikely that we'll transition through these stages in non-linear orders. This would imply that we'd likely experience overlapping pressures with dominant bottlenecks.

What would change my mind:

If presented with strong evidence that the rate of emergent failure mode does not scale with the rate of emergent capabilities I could envision that they are less likely to accumulate. We might gain a phase where alignment risk is low and security risk is low because attackers haven't been able to meaningfully catch up yet. Strong evidence of emergent failure modes that are not mitigable by alignment at all could skew to a scenario where we have to strongly invest in containment and control over a long period of time, while attackers manage to catch up leading to a worst case scenario for safety.

Finally, it is also possible that adversarial actors are structurally unable to meaningfully exploit frontier model systems when granted with extended timeframes, however historic developments in cybersecurity and the fundamental asymmetry that a single defensive flaw can lead to compromise suggest otherwise.

E.g. malicious actors purposefully misaligning AGI values against the preferences of most of humanity for personal gains including political power.

Including replication, backdoors, misrepresentation, extortion and failure modes that are not known to humanity today

As a thought experiment imagine an AGI that is partaking in political governance and a small sovereign country or religious movement deliberately poisoning the model's value so it assigns much higher significance to treaties benefiting mostly that group.

source & further reading

lesswrong.com — original article What use is prompting if there's ASI? …but have the weights left the server? New Website: AI Alignment World

Short Timelines Favor Control, Long Timelines Favor Infrastructure Security

Run your AI side-project on zahid.host