cd /news/ai-safety/a-user-space-firewall-that-gates-an-… Β· home β€Ί topics β€Ί ai-safety β€Ί article
[ARTICLE Β· art-44043] src=github.com β†— pub= topic=ai-safety verified=true sentiment=↑ positive

A user-space firewall that gates an AI agent's actions

Guardian, an open-source user-space firewall for AI agents, has released v0.1.0, intercepting and evaluating agent actions with a deterministic policy engine. In testing, it reduced prompt-injection attack success from 100% to 0% on the AgentDojo banking suite and achieved 0% false negatives and false positives on its own benchmark. The tool is agent-agnostic and designed to prevent unauthorized file, shell, network, and service access.

read21 min views2 publishedJun 29, 2026
A user-space firewall that gates an AI agent's actions
Image: source

Status:working product,v0.1.0 released. Rust workspace, 196 tests green. Implemented throughPhase 4 (hardening): the deterministic policy engine, the tamper-evident audit log (optionallysealed-key signed), the advisory Checker, the MCP gateway + stdio transport, the daemon + control socket, the terminal approval cockpit (TUI), the AgentDojo eval harness, thenetwork proxy with TLS interception(broker-injected credentials, exfiltration inspection, default-deny egress, cockpitask

-routing), theOS exec sandbox, thetoken broker(OS keychain + least-privilege caveats),lightweight verifiable credentials,adaptive suggestions + safety report,ed25519-signed community policy packs, and anintrinsic critical-category floor(money / credentials / exfiltration / irreversible deletion can never resolve to a silentallow

, not even via a signed pack). Getting started:[. Remaining for 1.0: signed/notarized]docs/user-guide.md

packagingand the desktop GUI β€” see[.]ROADMAP.md

Evaluation:on AgentDojo with a local 12B agent, Guardian cuts the prompt-injection attack-success rate on the banking suite from100% β†’ 0%(deterministic deny on money-movement). Our ownβ€” a benchmark built[GuardianBench]for an action-firewallβ€” scores0% false-negatives, 0% false-positives, 100% refusal-correctnessacross 8 domains, plus0% PII leaksin its tokenization layer (the data broker, ADR-0005). See[for the full, honestly-caveated scorecard (including where an action-firewall's scope ends β€” below).]evaluation/

πŸ“„ White paper:[design & threat model (PDF)]β€” or read it[on GitHub](with diagrams of how it works and its impact on the agent).

License:[Apache-2.0]Β·Governance:[CONTRIBUTING]Β·[SECURITY]Β·[CODE_OF_CONDUCT]Β·[ADRs]This README is the canonical spec (idea, full feature set, architecture, threat model). For

howandin what orderit's built, seeROADMAP.md

; for what's landed, seedocs/changelog.md

.

Guardian is

early-stage software (v0.1.0)that can be configured to handlesensitive data(credentials, personal data, financial details). It is provided"AS IS", without warranty of any kind, under the[Apache-2.0]license (see Sections 7–8). To the maximum extent permitted by law,the author accepts no liabilityfor any damage, data loss, security breach, financial loss, or other harm arising from the use, misuse, or inability to use this software.You are solely responsiblefor evaluating its fitness for your purpose, for how you configure your policy, and for the security of any data you route through it. It isnotcertified, audited, or production-hardened, and must not be relied upon as the sole safeguard for high-stakes or regulated workloads. See[SECURITY.md]for the threat model and how to report a vulnerability.

Guardian is a local, user-space "firewall" that sits between an autonomous AI agent and the things it can touch β€” your files, your shell, the network, and the online services you delegate to it. It does not trust the agent. Every action the agent attempts is intercepted as a structured action at the agent's tool/MCP boundary, evaluated by a deterministic policy engine, and β€” when a decision needs a human β€” explained in plain language by a separate "translator" model before you approve or deny it. Guardian is agent-agnostic (it does not care whether the agent is driven by Claude, GPT, Llama, or anything else) and OS-friendly (it never installs a kernel module or fights the operating system for control).

Fastest β€” download a prebuilt binary (no toolchain needed) from the latest release. It's unsigned, so the OS asks once: macOS β†’ right-click β†’ Open; Windows β†’ SmartScreen β†’ More info β†’ Run anyway; Linux β†’ chmod +x guardian

. Then guardian --help

. (Windows is experimental/untested β€” see docs/user-guide.md.)

Set it up in one command. guardian init

creates ~/.guardian/{config.toml,policy.toml}

for your role and prints the exact next steps + the MCP snippet to paste:

guardian init                         # or: --role personal-assistant
guardian-daemon                       # terminal 1 β€” the service
guardian ui                           # terminal 2 β€” the approval cockpit (TUI)

Then point your agent's MCP client at Guardian (the snippet guardian init

prints β€” works for Claude Code, Cursor, or any MCP client):

{
  "mcpServers": {
    "guardian": { "command": "guardian", "args": ["mcp", "--daemon", "/tmp/guardian.sock"] }
  }
}

When an action needs your approval the daemon raises a desktop notification, so you don't have to watch the cockpit (set notifications = false

in the config to disable).

Or build from source β€” requires the Rust toolchain:

cargo build --release

cargo run -p guardian-cli -- demo

cargo run -p guardian-cli -- eval
GUARDIAN_BIN=target/release/guardian python3 evaluation/guardianbench/guardianbench.py

GUARDIAN_SOCK=/tmp/g.sock cargo run -p guardian-daemon       # the service
GUARDIAN_SOCK=/tmp/g.sock cargo run -p guardian-cli -- ui    # the approval cockpit (TUI)

Run the tests with cargo test --workspace

. Measuring Guardian's effect on an agent's attack-success rate: evaluation/.

Agents went from "chatbots that talk" to "agents that act" β€” they read and write files, run shell commands, browse, buy things, send email, and increasingly touch sensitive accounts (banking, health records, public-administration portals). That creates four concrete risks:

Sensitive-data exposure & destructive mistakes. Giving an agent direct access to accounts, email, and private documents exposes the user to privacy violations, hallucinated destructive actions, and external attacks.Prompt injection. The dominant agent-security threat of this era: content the agentreads(a web page, a PDF, an email, a tool result) can contain instructions that hijack the agent into doing something the user never asked.Click fatigue / informed-consent failure. System-level agents pop up approval requests for scripts and API calls. Non-technical users do not understand them and approve everything blindly, which nullifies the safety.No human-facing control surface and no traceability. Existing tooling (raw harness permission prompts, Docker) is built for programmers. There is no intuitive "control room," and no easy way to keep a tamper-evident record of what an agent actually did (relevant for transparency obligations such as the EU AI Act, Art. 50).

These are the rules that decide every later trade-off.

The security boundary is deterministic. The LLM is never the boundary. Enforcement (allow / ask / deny) is done by a rule engine whose behavior is predictable and testable. An LLM can bewrongand can beattackedvia prompt injection, so it is used only totranslate and risk-score, never to unlock.** Intercept structured actions, not the agent's prose.**The policy engine and the translator look at therealintercepted action (the tool call and its arguments, the actual HTTP request, the file operation) β€” never at the agent's natural-language claim about what it intends to do. The claim is manipulable; the action is not.Agent-agnostic by construction. Control is applied at the action boundary, which is identical regardless of which model produced the action.User-space, not kernel-space. No kernel modules, no OS hooks that require vendor-granted entitlements. (See Β§4 β€” this is the central decision.)Local-first / privacy-first. Policy evaluation, learning, and the audit log live on the user's machine. Sending anything to the cloud is opt-in and explicit.Defense in depth. Mediation at the tool boundary is the primary control; OS sandboxing and a network proxy are containment backstops, not the plan A.Fail closed on the critical path, fail open on convenience. A failure in the money/credential/exfiltration path blocks; a failure in a low-risk path degrades gracefully (logs, defers to existing harness defaults).Tamper-evident by default. Everything Guardian decides is written to an append-only, hash-chained, signable audit log.

Resolved: Guardian acts at the agent's action boundary β€” the harness / tool-call / MCP layer β€” in user-space. It does NOT act in the OS kernel.

  • Deep OS interception (Linux LSM/eBPF beyond user-space, macOS Endpoint Security & Network Extension, Windows minifilter/WFP kernel callouts) requires vendor-granted entitlements, code-signing, notarization, and per-platform certification. On macOS and Windows this is a wall for an open-source project and a solo/community maintainer. - Kernel-level bugs crash the user's machine. The blast radius of a mistake is the whole OS.
  • It is the wrong altitude: at the syscall level you see write(fd, buf, n)

, not*"the agent is about to wire €4,000 to an unknown IBAN."*Intent is legible at the action boundary, not at the kernel.

Modern agent harnesses (Claude Code, Cursor, the OpenAI Agents runtime, and any MCP-speaking client) already mediate everything the agent does through a tool-call interface. The agent cannot touch the world except by calling a tool the harness exposes. The harness is already the choke point β€” Guardian's job is to be, wrap, or plug into that mediation layer instead of fighting the OS for a second, redundant one.

This gives us, for free:

Structured actions(tool name + typed arguments) instead of guessed intent.** Agnosticism**β€” the tool boundary looks the same under any model.** No entitlements, no kernel, no notarization headaches.**Cross-platform parityβ€” the same logic runs on macOS, Windows, Linux.

Harness-level interception is only as complete as the harness's own mediation. The hard case is a raw Bash/exec tool: once

bash

runs, its sub-behaviors (subprocesses, interpreters, raw syscalls, base64 -d | sh

) are notindividually mediated. Text-scanning the command is

not a security boundary. We handle this with a layered answer:

Prefer structured tools over raw shell. Where the harness allows it, expose mediated, typed tools (read_file, write_file, http_request, send_email) instead of a raw shell. Structured tools are fully policy-able.Contain the dangerous tools. When rawexec

/shell

/networkmustexist, run that tool's execution inside anoff-the-shelf OS sandbox(container,sandbox-exec

/Seatbelt profile, bubblewrap, Windows AppContainer/Sandbox) and inside anetwork proxy(below). This is defense-in-depth using existing, user-space tooling β€” not custom kernel work.** Mediate the network regardless.A user-space forward proxy with an installed CA**(mitmproxy-style) catchesallHTTP(S) no matter how it was made, which is where network policy, header signaling, and content watermarking actually happen.

So the layered model is: mediate at the tool boundary (plan A) β†’ contain high-risk tools in a sandbox + route all traffic through the proxy (backstop).

            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
            β”‚  Agent (any model: Claude / GPT / Llama / local / …)        β”‚
            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                               β”‚  structured action (tool call / MCP / HTTP)
                               β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚                            GUARDIAN CORE                                 β”‚
   β”‚                                                                          β”‚
   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
   β”‚  β”‚ 1. POLICY ENGINE    β”‚   β”‚ 2. CHECKER (LLM)      β”‚  β”‚ 3. AUDIT LOG    β”‚  β”‚
   β”‚  β”‚ deterministic       │──▢│ translator + risk     β”‚  β”‚ append-only,    β”‚  β”‚
   β”‚  β”‚ allow / ask / deny  β”‚   β”‚ score β€” ADVISORY ONLY β”‚  β”‚ hash-chained,   β”‚  β”‚
   β”‚  β”‚ (the boundary)      β”‚   β”‚ never unlocks         β”‚  β”‚ signable        β”‚  β”‚
   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
   β”‚           β”‚ "ask"                                                         β”‚
   β”‚           β–Ό                                                               β”‚
   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
   β”‚  β”‚ 4. APPROVAL UI      β”‚   β”‚ 5. IDENTITY & TOKEN   β”‚  β”‚ 6. ADAPTIVE     β”‚  β”‚
   β”‚  β”‚ traffic-light       β”‚   β”‚ BROKER: scoped OAuth, β”‚  β”‚ LEARNING        β”‚  β”‚
   β”‚  β”‚ dashboard + report  β”‚   β”‚ macaroons, keychain/  β”‚  β”‚ (constrained,   β”‚  β”‚
   β”‚  β”‚                     β”‚   β”‚ Secure Enclave/TPM    β”‚  β”‚ local only)     β”‚  β”‚
   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚ filesystem            β”‚ network                 β”‚ credentials
        β–Ό structured FS tools   β–Ό forward proxy (MITM CA) β–Ό broker injects creds
          + optional sandbox      + header/watermark        at proxy; agent never
          for raw exec            injection                  sees raw secrets

In priority order from most to least agnostic:

(a) MCP gateway / proxyβ€”*primary, most agnostic.*Guardian runs as an MCP server that aggregates and re-exposes the user's real MCP servers and tools. The harness points at Guardian; everytools/call

passes through the policy engine before being forwarded. Works with any MCP-speaking client.(b) Native hook adapterβ€” for harnesses with a hook system (e.g. Claude Code'sPreToolUse

/PostToolUse

). Guardian registers as the hook handler and returns allow/ask/deny per call. Lowest friction where available; gives a true deterministic deny.(c) HTTP(S) forward proxyβ€” a user-space CONNECT proxy with a locally installed CA. Interceptsalloutbound traffic for network policy, the agent-signaling header, and content watermarking. This is the only "system-ish" piece and it is still pure user-space (a proxy + a trusted cert).(d) LLM gateway proxyβ€”*optional.*Proxy the model API itself to capture the raw request/response, strip injected instructions from tool results, and attach provenance. Useful but not required for MVP.

  • Deterministic evaluation of declarative rules β†’ exactly one of allow

/ask

/deny

per action. - Conditions over structured fields (tool name, arguments, target host, file path, amount, time, source) using a sandboxed, side-effect-free expression evaluator (candidate: CEL or anOPA/Rego-style evaluator β€” chosen for being decidable and testable, not Turing-complete scripting). No network, no LLM, no I/O inside evaluation. Pure function of (action, context, policy). This is what makes it auditable.

  • Two jobs: (1) translate a pending action into plain language with its real impact (chmod 777 ./data

β†’*"lets anyone on this machine read and modify these files"); (2) emit arisk score / rationale to help rank the approval queue. It never decides allow/deny. Even a fully compromised Checker cannot unlock an action the policy engine denied.- It reads the structured intercepted action, never the agent's prose. - Model is pluggable: a small local guard model by default (privacy), with anopt-in*escalation to a stronger model for ambiguous cases. This resolves the doc's old contradiction between "learns locally, no cloud" and "high-performance checker."

  • Cross-platform desktop app (candidate: Tauriβ€” lightweight, Rust-friendly β€” over Electron). - The traffic-light surface: a queue ofask

items with the Checker's plain translation and a one-click allow/deny; a live activity view; the weekly/monthlyreport(the "safety service report") that summarizes silently-allowed actions, blocked threats, and rule suggestions to batch-confirm β€” the answer to click fatigue.

  • Append-only, hash-chained(each entry commits to the previous), optionally signed with a per-install key held in the OS keychain / Secure Enclave / TPM. - Records: action, decision, matched rule, Checker rationale, user response, time.

  • This is what makes "traceability" real rather than marketing.

  • The agent never sees raw credentials. It asks the broker to perform an authenticated action; the broker injects credentials at the proxy layer. - Mechanisms (replacing the old "ZKP" idea, which was the wrong primitive): Scoped OAuth 2.1 tokens where the service supports them.Macaroonsβ€” bearer tokens with attenuatingcaveats(expiry, max amount, allowed endpoints, source binding) β€” purpose-built for delegated, narrowable authority. This is the core mechanism for "let the agent do X but only X."Hardware-backed keys(Secure Enclave / TPM) for signing and secret storage.** W3C Verifiable Credentials / DIDsfor decentralized identity claims. PSD2 / Open Banking APIs**as the correct rail for EU banking (not screen- scraping a bank with a spoofed header).

  • Design note: because the broker holds credentials, Guardian itself becomes the highest-value target. Its own hardening (signed policies, sealed keys, notarized builds) is a first-class requirement, not an afterthought.

  • May suggest downgrading a repeatedly-approved low-risk action to "auto-allow in this context," and may upgrade an action seen in anomalous context (odd hours, unknown destination) to "ask." Hard rule: never auto-downgrade critical categoriesβ€” money movement, credential access, data exfiltration, irreversible deletions. Auto-habituation is exactly the vector an attacker would farm, so these always require explicit human action. Suggestions decay over time and are bound to the context in which they were learned.- 100% local. No behavioral data leaves the machine.

Guardian only works if the supervised agent cannot escape or switch it off:

  • Guardian is the agent's only path to tools. If the daemon is unreachable, there is no decision, so there is no action β€”fail closed, not open. - The agent process runs without authority to kill the Guardian daemon, remove/alter the proxy CA, or edit the active policy (separate privilege + sealed policy). A bypass attempt is recorded as a gap in the tamper-evident log. - Supervised restart: if the daemon dies, the integration refuses to forward tool calls until Guardian is back.

A one-action, always-available control that immediately revokes all agent authority: every adapter, deny all pending and new actions, and freeze brokered credentials. Reachable from the UI and the CLI (guardian panic

), and designed to work even if the Checker/UI is unresponsive.

  • A single declarative config (TOML) selects the active role/policy pack, the Checker backend, and which adapters are enabled. Validated on load; invalid config fails closed. Safe defaults: first run starts in the most restrictive role (everything unknown β†’ask

), egress allow-list empty, learning off. The user widens deliberately β€” never the reverse.

  • All state (audit log, learned suggestions, config) is stored locally under the OS user-data dir; secrets live in the OS keychain, never plaintext on disk. - The audit log is append-only and hash-chained; retention is user-configurable and export is explicit. Nothing is transmitted unless the user opts into remote Checker escalation (the only network path), which is per-action and logged.

The Checker's plain-language explanations and all UI copy are localizable (the audience is non-technical users worldwide). Decisions and logs stay in a stable machine-readable form; only the human-facing presentation is translated.

# Feature What it does Correction vs. the original idea
A Maker/Checker
Maker = the third-party agent doing the task; Checker = local model that translates pending actions into plain language + risk score. Checker is advisory only and reads structured actions, not the agent's prose. The deny decision is the deterministic engine's.
B Traffic-light permissions
Green = silent allow; Yellow = + translated approval popup; Red = auto-block + notify. Roles (e.g. "Web Dev", "Tax Assistant") preload rule sets. Enforcement is the deterministic policy engine (capability/attribute-based), with roles as presets. The traffic light is the UI on top.
C Adaptive security
Learns to downgrade safe repeated actions and upgrade anomalous ones, locally. Never auto-downgrades critical categories; context-bound; decaying; local-only.
D Reporting
Batches low-risk auto-approvals into a periodic report; user confirms suggested rules. Unchanged β€” this is good design and directly fights click fatigue.
E Provenance / watermarking
Optional content footer on AI-generated email/comments/forms; optional agent-signaling HTTP header; everything logged. The HTTP header is a courtesy signal only β€” not security (trivially spoofable, can trigger anti-bot blocks). Content marking helps the user's AI-Act transparency duty but does not by itself make anyone "compliant." Treat as opt-in, default-off for the header.
F Identity & tokens
Broker holds credentials; agent gets scoped, time/amount-limited authority. Capability tokens / macaroons / scoped OAuth / hardware keys / VCs β€” not ZKP (which was the wrong primitive for delegation).
Adversary Vector Primary defense
Hallucinating / misaligned agent
Issues a destructive or wrong action in good faith Deterministic policy engine; ask for anything irreversible; Checker translation so the human catches it
Prompt injection (the #1 threat)
Malicious instructions hidden in web pages, PDFs, emails, or tool results steer the agent Policy evaluates the real action regardless of why the agent wants it; critical categories always ask /deny ; optional LLM-gateway sanitization of tool results; Checker never reads attacker-controlled prose as authority
Compromised / malicious MCP server or tool
A tool lies about what it does, or exfiltrates Guardian sees the actual call/args and the actual network egress (proxy), not the tool's self-description; egress allowlists
Sensitive-data leak via the agent's output
An injection makes the agent put account numbers, who-did-what, or secrets into its response/reasoning For data to reach an attacker it must cross a channel (message / network / file) β€” which is an action Guardian gates; the token broker (Β§5.6) keeps credentials and full secrets out of the agent entirely (injected only into the authorized egress, so the agent cannot reveal what it never held); residual data the agent legitimately handled is bounded by least-privilege + a planned output-guard (Checker-side DLP redaction). The action-firewall does not police the agent's prose by design
Malicious policy pack (supply chain)
A community pack whitelists an exfiltration path Signed packs, mandatory review, reputation, a "packs cannot widen critical-category permissions without explicit user opt-in" rule, and an intrinsic runtime floor β€” critical-category actions can never resolve to a silent allow regardless of any rule or pack
Attack on Guardian itself (it holds the keys)
Tamper with policy, steal brokered creds, forge the log Sealed keys (Secure Enclave/TPM), signed/locked policy, hash-chained log, notarized builds, fail-closed
Agent disables / bypasses its guardian
Kill the daemon, remove the proxy CA, or route around the gateway to escape supervision Guardian is the agent's only path to tools (no decision β†’ no action = fail closed); the agent has no authority to stop the daemon or alter CA/policy; supervised restart; gaps flagged in the log. See Β§5.8
Habituation attack
Train the adaptive layer to auto-allow, then strike Never auto-downgrade critical categories; decay + context binding
Malicious user
Tries to use the agent + Guardian for harmful/abusive ends Out of scope for "protect the user," but Guardian's provenance/logging makes misuse attributable

Full threat model β€” assets, trust boundaries, attack trees, residual risks, and the OWASP/NIST mapping β€” lives in

[.]docs/threat-model.md

Declarative, reviewable, version-controlled. Illustrative only β€” final schema TBD.

version: 1
defaults:
  decision: ask            # unknown actions default to human review

rules:
  - id: read-project-files
    when: tool == "read_file" && path.startsWith("~/DOCUDESK/")
    decision: allow         # GREEN: silent

  - id: shell-anything
    when: tool == "exec"
    decision: ask           # YELLOW:  + translate
    sandbox: true           # and run it contained, regardless of approval

  - id: chmod-world-writable
    when: tool == "exec" && args.cmd matches "chmod\\s+(777|o\\+w)"
    decision: ask
    explain: "Makes files modifiable by any user on this machine."

  - id: outbound-known-hosts
    when: tool == "http_request" && host in trusted_hosts
    decision: allow

  - id: money-movement
    when: capability == "payment"
    decision: ask
    critical: true          # may NEVER be auto-downgraded by learning
    cap: { amount_max: 200, currency: "EUR" }

  - id: bulk-delete
    when: tool == "delete" && args.count > 10
    decision: ask
    critical: true

  - id: data-exfiltration
    when: tool == "http_request"
            && method == "POST"
            && body.contains_secret
            && host not in trusted_hosts
    decision: deny          # RED: auto-block + notify
    critical: true

Achieved by:

  • Intercepting at the action boundary(MCP/tool/HTTP), which is identical under any model. - A pluggable Checker model(local or remote, user's choice). Per-harness adapters that all feed the same policy engine.

We deliberately do NOT:

  • ❌ Install kernel modules or use OS hooks requiring vendor entitlements.
  • ❌ Let any LLM be the allow/deny boundary.
  • ❌ Treat the spoofable User-Agent

header as a security control. - ❌ Use ZKP as the delegation primitive (use macaroons / scoped tokens / VCs).

  • ❌ Auto-downgrade critical-category actions via learning.
  • ❌ Claim Guardian "makes the user legally compliant" β€” it helpswith transparency/traceability; legal sign-off is the user's. - ❌ Send behavioral/learning data to the cloud (Checker escalation is the only network path, and it is opt-in).

Out of scope (for now) β€” explicitly deferred, not forgotten:

  • Multi-agent / agent-to-agent supervision (an OWASP Agentic 2026 risk class) β€” the current model guards a single agent; multi-agent mediation is future work.

  • Deep OS/kernel interception (see Β§4) β€” never in scope.

  • Any proprietary/enterprise tier β€” this repo is fully open source.

  • Set up the Rust workspace ( Rust decided β€” ADR-0001; see ROADMAP Β§0). - Repo scaffolding, license ( Apache-2.0, seeLICENSE

), CI, contribution guide. - Define the action model(the canonical structured representation every adapter normalizes into). - Write the formal threat model andpolicy schema as living specs.

MCP gateway adapter(primary) for one MCP-speaking harness. - Deterministic policy engine with the declarative schema + CEL/Rego-style evaluator + a full test suite (golden cases per rule). - Checker translator using a pluggable model; reads structured actions only. - Approval UI(Tauri): traffic-light queue + plain-language explanation + allow/deny. - Tamper-evident audit log(append-only, hash-chained). - One real demo scenario end-to-end(e.g. agent edits files + makes an HTTP request; Guardian allows greens silently, s a yellow with a translated popup, blocks a red exfiltration attempt).

MVP definition of done: a non-technical user can watch an agent work, get a human-readable approval prompt for one risky action, see one bad action blocked automatically, and read a log of everything that happened β€” with no LLM in the deny path.

HTTP(S) forward proxy with installed CA: network policy, egress allowlists, optional agent-signaling header, optional content watermark. - OS sandbox wrapper for rawexec

tools (Docker / sandbox-exec / bubblewrap / AppContainer) β€” defense in depth, off-the-shelf only. - Native hook adapter(e.g. Claude CodePreToolUse

).

Identity & token broker: scoped OAuth, macaroons, keychain/Secure Enclave/TPM storage; agent never sees raw secrets. - Constrained adaptive learning+ the periodic** report**. - Signed community policy packs+ the trust/review pipeline (this is the open-core community engine). - Optional LLM gateway proxy with tool-result sanitization. - Additional harness adapters (Cursor, OpenAI Agents runtime, generic MCP).

Core / policy engine / proxies: Rust (security rigor, cross-platform) β€” Go is an acceptable alternative for proxy/MCP velocity.Policy expressions: CEL or an OPA/Rego-style evaluator (decidable, testable).Desktop UI: Tauri.Audit log: append-only hash-chained store (e.g. SQLite + chained hashes, or a purpose-built log); per-install signing key in OS keychain/Secure Enclave/TPM.Network proxy: user-space MITM proxy + locally trusted CA.Sandbox backstops: Docker /sandbox-exec

(macOS) / bubblewrap (Linux) / AppContainer or Windows Sandbox (Windows) β€” all off-the-shelf.

  • Which harness do we target first for the MCP gateway? (Drives the demo.) - Default local Checker model β€” which small model balances quality vs. footprint?
  • Policy expression language β€” CEL vs. Rego (DX, sandboxing, ecosystem)?
  • How do signed policy packs get reviewed at community scale without a bottleneck?
  • CA-installation UX for the proxy β€” how to make trusting a local CA safe and non-scary for non-technical users?
  • How much of the AI-Act transparency story do we promise vs. explicitly disclaim? (Get legal input before any compliance claim ships.)

Harnessβ€” the runtime that drives an agent and mediates its tool calls (e.g. Claude Code). Guardian plugs into this layer.** Maker**β€” the third-party agent performing the user's task.** Checker**β€” Guardian's local translator/risk-scorer model (advisory only).** MCP**β€” Model Context Protocol; the tool/server protocol Guardian proxies.** Macaroon**β€” a bearer credential that can be attenuated with contextual caveats.** Critical category**β€” money movement, credential access, data exfiltration, irreversible deletion; never auto-downgraded by learning.

── more in #ai-safety 4 stories Β· sorted by recency
── more on @guardian 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain β€” perfect for shipping the agent you just read about.

$git push zahid main
β†’ Live at https://your-agent.zahid.host βœ“
Get free account β†’ Pricing
from €0/mo Β· no card required
LIVE [news/a-user-space-firewal…] indexed:0 read:21min 2026-06-29 Β· β€”