{"slug": "the-frontier-for-economic-value-from-ai-agents-is-non-gullibility", "title": "The frontier for economic value from AI agents is non-gullibility", "summary": "A developer argues that the key limitation for AI agents in enterprise software is not reasoning capability but non-gullibility—the ability to distinguish trusted instructions from untrusted data. Current LLM APIs and harnesses often mix authority levels, making agents vulnerable to prompt injection and information exfiltration. The developer calls for better harness construction to support non-gullibility, which is essential for agents to perform privileged actions securely.", "body_md": "The usual measures of AI progress have not suited my lived experience for some time now:\n\nOne measure is “maximum length of human task that AI can complete.” The ideal goal here seems to be AI developing ever larger software systems, like browsers.\n\nAnother measure is “key reasoning breakthroughs,” like proving some math theorem or finding zero-days. The ideal goal here seems to be the [Riemann hypothesis](https://x.com/polynoamial/status/1834280969786065278?s=20).\n\nI think both are worthwhile goals. But in my day job doing enterprise software development, neither of these is limiting me. What limits me is the degree of trust I am permitted to place in an agent’s actions without compromising my employer’s information security.\n\nAlmost everybody limits what their AI agent can do, through sandboxing, approval flows, or manual review.\n\nThis is not mostly about wanting to judge whether the AI’s work was good enough to merge. It is about defending against low-probability but devastating scenarios: the agent exfiltrating information to an attacker, installing malware, or granting privileged access to company computers.\n\nThe missing property is non-gullibility: the ability to distinguish trusted instructions from untrusted data, even when the untrusted data is adversarially shaped to look like instructions.\n\nTo be as useful as a human, the AI agent needs to consume potentially malicious input, like web search results, documentation, GitHub issues, and repository files. It needs to follow legitimate instructions in documentation, while rejecting malicious search results that instruct it to install malware.\n\nThe agent also needs to work with confidential information, and must not exfiltrate it, including through the side effects of its actions, such as fetching URLs with confidential information in them.\n\nAnd the agent needs to perform privileged actions: deployments, service integrations, infrastructure changes. It is not practically possible to validate Terraform code without running it against a real cloud, nor to validate an integration with another service without performing real calls to their systems. Modern web application development often consists largely of stitching together services, some external SaaS vendors, others internal but separately deployed.\n\nSo while my X feed celebrates an Erdős problem solution or the [Bun Zig-to-Rust migration](https://claude.com/blog/introducing-dynamic-workflows-in-claude-code), many people are still stuck in manual code review and approval fatigue because they feel obliged to prevent unlikely but devastating consequences.\n\nNon-gullibility is not a property that can be achieved by the foundation model alone. It also requires correct construction of the agent harness, because only the harness knows the data's true source and trustworthiness, and the harness assembles trusted and untrusted data into the model’s input.\n\nHere, it seems to me that non-gullibility could be supported much better. LLM APIs today distinguish system, user, assistant, and tool messages in principle, but practical systems often mix authority levels for two reasons:\n\nAPIs and harnesses restrict which message types are permitted in which positions. System messages are often only available at the beginning of the conversation. Tool messages are only allowed in response to tool calls.\n\nA workaround in practical harnesses like Claude Code and opencode is to include a textual fragment like `<system-reminder>`\n\nwherever the information needs to go. A user or tool message may include such a reminder to tell the model that another file changed, or that the user interrupted the model mid-task.\n\nThis is convenient, but it muddies the security boundary. A model trained to perform well in such a harness may learn to treat `<system-reminder>`\n\nas carrying special authority. How should the harness defend against prompt injection containing `<system-reminder>`\n\ntags? The obvious answer is escaping. But it does not seem customary to XML-escape all untrusted input, and escaping everything would also increase token usage.\n\nMessage history is a sequence, while authority is naturally a hierarchy.\n\nConsider the output of a hypothetical grep tool with line numbers. Different parts of the output have different authority levels:\n\nHigher-authority information: the fact that the trusted harness ran the built-in grep tool and found matches at certain paths and line numbers.\n\nLower-authority information: the path strings and file contents themselves. These may come from an untrustworthy source. If a file contains a tag, the model should not treat it as special.\n\nTo supplement fixed roles like system, user, assistant, and tool, imagine an API that also supports a pair of special tokens for opening and closing a lower-privilege scope within a message. The harness developer could then express the authority of conversation fragments precisely.\n\nIf we visualize the lower-privilege opening token as `{`\n\n, and the corresponding closing token as `}`\n\n, then the grep tool could put quoted file contents inside `{}`\n\nto make clear that they are untrusted data. Here, `{`\n\nis a special token, not a regular character that could occur in text.\n\nA conversation history might look like this:\n\n```\nSystem: You are a helpful assistant.\nUser: {Where does the term \"foo\" occur in this repository?}\nAssistant: {<use-tool git-grep term={foo}>}\nTool: { {blubb.txt}:3:{*foo* bar baz} }\n```\n\nThe system message is not wrapped in a trust limiter because it is already the highest-authority context.\n\nThe user message is wrapped in a trust limiter because the user must not be able to override the system message.\n\nFor the tool message, the important part is that the delimiters can be nested. The tool should not be able to assume the authority of the user or system, so the entire tool output is wrapped in a trust limiter. At the same time, the tool output contains parts that are themselves arbitrary data, so the tool wraps those in another, nested pair of delimiters.\n\nThis proposal would have to be part of the API between harness and model. In particular, the LLM server would reject improperly balanced delimiters in the input, and would enforce that model responses are balanced as well. Literal `{`\n\nand `}`\n\ncharacters in user data would not matter, because they would not be the special delimiter tokens.\n\nOne could argue that if we used escaping correctly, the LLM should be able to reason about the effective trustworthiness of data in context, and introducing a new set of delimiters would run against the spirit of having general-purpose models. I see three counterarguments to this:\n\nThe best open-source LLMs already use special tokens to indicate roles. The present proposal is not less generic, or less bitter-lesson-pilled, than what is already done.\n\nWe already allow LLMs to use code interpreters for efficiency, even though LLMs should in principle be able to reason about the outcome of code execution. It is just more efficient and reliable to interpret code before the result reaches the LLM. Likewise, it may be more reliable to deterministically encode quoting and privilege boundaries before the token stream reaches the LLM.\n\nLLM input commonly has a very ad hoc nested structure. A Markdown code block may contain XML, which contains JSON, or any nesting thereof. It is not necessarily invalid input if a Markdown code block contains broken XML or broken JSON. Perhaps broken XML is exactly what the code block is supposed to show. Explaining to the model which parts of the input are to be expected to be properly escaped and which are not without ambiguity is tricky. The relevant question is not whether the model can reason about trust, but rather whether the harness has a precise way to express trust to the model.\n\nFor economically useful agents, I expect non-gullibility to matter more than another jump in task length or theorem-proving. The bottleneck is whether we can safely let the agent use the capability it already has.", "url": "https://wpnews.pro/news/the-frontier-for-economic-value-from-ai-agents-is-non-gullibility", "canonical_source": "https://dev.to/felixhuttmann/the-frontier-for-economic-value-from-ai-agents-is-non-gullibility-ojm", "published_at": "2026-06-13 23:53:07+00:00", "updated_at": "2026-06-14 00:29:13.731129+00:00", "lang": "en", "topics": ["ai-agents", "ai-safety", "large-language-models", "ai-infrastructure", "developer-tools"], "entities": ["Claude Code", "Bun", "Zig", "Rust", "GitHub", "Terraform"], "alternates": {"html": "https://wpnews.pro/news/the-frontier-for-economic-value-from-ai-agents-is-non-gullibility", "markdown": "https://wpnews.pro/news/the-frontier-for-economic-value-from-ai-agents-is-non-gullibility.md", "text": "https://wpnews.pro/news/the-frontier-for-economic-value-from-ai-agents-is-non-gullibility.txt", "jsonld": "https://wpnews.pro/news/the-frontier-for-economic-value-from-ai-agents-is-non-gullibility.jsonld"}}