TLDR: We may capture much or most of the available AI safety benefit by reserving expensive, specialized agents for the <1% of tasks that carry catastrophic risk. This would mean that AI safety work on high-cost but highly safe systems could be very useful.
The standard objection to compute-heavy AI safety measures is competitive: any lab paying a large alignment tax gets undercut by one that doesn't, so expensive safety doesn't survive the market.
This objection typically assumes the tax is paid uniformly - levied on every action regardless of what that action is doing. Drop that assumption and the objection loses most of its force. If the expensive treatment can be applied selectively, to the small fraction of actions where catastrophic consequences live, the blended overhead is small even when the per-action multiplier is enormous. A 100× tax on 0.1% of actions is roughly a 10% tax on the system.
How would one spend 100× on additional safety? The easiest answer is inference-time safety solutions. Redwood’s control research has studied several. Certain situations could also call for different LLMs, perhaps ones optimized for transparency, robustness, neutrality, or that have simply been more heavily vetted.
Could we identify and isolate the top n% of tasks? There are two clear solutions. The first is to get good at understanding which AI tasks are most critical. I'd assume there are some fairly obvious moves to start with. Frontier LLM development and deployment tasks seem critical to get right; a major corporate database migration is clearly more critical than a local one. If you simply charged 100× for more robust LLM agents today, I'd expect them to get used for some of the most important cases by default.
The second is to actively design processes so the critical work is contained to a narrow amount of computation. Companies already do this in many settings: it's risky to hand direct bank-account access to every employee, so access is restricted to a few trusted ones and everyone else goes through a request process. The same shape applies to high-consequence AI actions.
Of course, a different solution is to use the ~100x costly agents training signals in post-training. This probably asks for agents with very different properties than ones optimized for general direct consumer use, but the big-picture economic justifications might be similar.
In principle, highly costly but reliable system use could be formalized. Certain AI development decisions might be deemed sensitive enough that they can only be carried out by a specific set of expensive, vetted (perhaps government-approved) AI agents. There are ways this goes poorly, but also versions that look like a reasonable extension of current practice.
Given all of this, I think that:
Objections
*This post was improved with Claude Opus. Opus provided high-level feedback, helped find the links, and made a bunch of wording adjustments. *