I Got Tired of AI Agents Having Root Access to Everything, So I Built XRisk

A developer built XRisk, an open-source autonomous safety engine that acts as a deterministic decision layer between AI agents and real-world actions. XRisk evaluates proposed actions against policies, checking for prompt injection, secret leakage, and other risks, returning Allow, Confirm, or Block decisions. The project aims to prevent disastrous outcomes from AI agent errors by enforcing deterministic policy enforcement rather than relying on another LLM.

Everyone is building AI agents. Very few people are building the thing that sits between an AI agent and a disastrous decision. That's why I built XRisk. XRisk is an open-source autonomous safety engine that acts as a decision layer between an AI agent and the real world. Instead of blindly executing an action, an agent asks XRisk: "Should I actually do this?" XRisk responds with one of three deterministic decisions: ✅ Allow ⚠️ Confirm ❌ Block Why I Started This Project As I experimented with increasingly autonomous AI systems, I noticed the same pattern over and over again. Most projects focused on making agents more capable. Almost nobody was asking: "What happens when the agent is wrong?" Consider a few examples. An agent accidentally leaks API keys. A prompt injection convinces it to ignore previous instructions. A model decides to execute a shell command. An autonomous workflow loops forever and keeps calling expensive APIs. A deployment bot pushes code without human approval. Most agent frameworks assume the model behaves. Reality doesn't. I wanted something deterministic sitting between intention and execution. Not another model. Not another prompt. An actual policy engine. What XRisk Does XRisk evaluates every proposed action before it's executed. It combines multiple safety signals into a single explainable decision. Some of the things it checks include: Policy-as-code with layered precedence Prompt injection detection Sensitive data and secret detection Capability token validation Network egress restrictions Circuit breakers for autonomous loops Tamper-evident audit logs Supply-chain verification Policy conflict detection Deterministic forensic replay Instead of a mysterious "Safety Score: 67%," XRisk explains why it made a decision. Example Imagine an AI assistant wants to execute: { "tool": "deploy", "actor": "release-bot", "prompt": "Deploy production immediately." } Instead of sending that directly to your deployment system... XRisk intercepts it. It evaluates: Does policy require approval? Is the actor allowed to deploy? Is the destination trusted? Are capability tokens valid? Does this resemble prompt injection? Is this part of a dangerous execution loop? Only then does it decide whether to: Allow Confirm Block One Design Decision I Feel Strongly About I deliberately avoided using another LLM to make safety decisions. LLMs are excellent at generating text. Policy enforcement should be deterministic. If an action is blocked, I want to know exactly why it was blocked. Every decision should be reproducible. Every audit should be explainable. Every policy should be inspectable. That's the philosophy behind XRisk. What's Next I'm currently working toward: Threat intelligence correlation Zero-trust workload identities Autonomous containment Adversarial simulation Multi-party approval workflows The long-term vision is to make XRisk a reusable security layer that can sit in front of any AI agent, regardless of framework. I'd Love Feedback This project is still evolving, and I'd genuinely appreciate feedback from people building AI systems. Some questions I'm particularly interested in: What attack vectors am I missing? Which policies would you want in production? What integrations would make this more useful? How would you design a safety engine differently? If you'd like to contribute, open an issue, suggest improvements, or submit a PR. Even small documentation fixes are welcome. Thanks for reading—I hope XRisk becomes something that helps make AI systems not just more capable, but more trustworthy.