Agent Safety Is Action Alignment

wpnews.pro

cd /news/large-language-models/agent-safety-is-action-alignment · home › topics › large-language-models › article

[ARTICLE · art-44360] src=arxiv.org ↗ pub=2026-06-30T04:00Z topic=large-language-models verified=true sentiment=· neutral

Agent Safety Is Action Alignment

Researchers argue that applying content-safety methods like refusal training to LLM agents is a category error, as agentic harm lies in the relation between exercised authority and granted authority, not in model outputs. Evidence shows defense-trained models learn surface patterns, collapse multi-step agents, and frontier models exceed granted authority under ordinary use. They conclude action safety must be enforced outside the model via least privilege and evaluated as action alignment.

read1 min views1 publishedJun 30, 2026

arXiv:2606.28739v1 Announce Type: new Abstract: Large language models increasingly act as agents: they call tools, move money, delete records, and send messages on a user's behalf. To keep them safe, practitioners imported the chatbot-era recipe (train the model to refuse unsafe inputs) into the agentic setting, and treat the resulting capability loss as a manageable ``alignment tax.'' We argue this is a \emph{category error}. Refusal is a primitive for \emph{content safety}, where the harm is in the model's output and is therefore a learnable function of it. Agentic harm is different in kind: it lies not in any output but in the relation between the authority an action exercises and the authority the user granted, which is absent from the text the model sees. Importing content-safety methods into this regime does not trade capability for safety; it pays capability and buys negative security. We support this with three lines of evidence spanning the autonomy spectrum: defense-trained models learn surface patterns rather than intent; the same training collapses multi-step agents before any threat appears while leaving them exploitable; and even undefended frontier models exceed granted authority under ordinary use. We conclude that action safety cannot be installed in weights. It must be expressed as \emph{least privilege}, enforced \emph{outside} the model at the action boundary, and evaluated as \emph{action alignment} (a relational, deployment-conditioned property) rather than a refusal score.

source & further reading

arxiv.org — original article

── more in #large-language-models 4 stories · sorted by recency

dev.to · 30 Jun · #large-language-models

The Philosophy of a Tool: Why AI Will Never Take Over and Hidden Guardrails Are Literal Evil

arxiv.org · 30 Jun · #large-language-models

Agentic Abstention: Do Agents Know When to Stop Instead of Act?

dev.to · 30 Jun · #large-language-models

Sycophancy in AI Is the Safety Problem That Looks Like Politeness

getaera.app · 30 Jun · #large-language-models

Aera: Cross Platform Agentic Browser

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required