Cerberus – a local firewall for AI agents' tool calls

Cerberus, a local-first security gateway for AI coding agents, intercepts every tool call before execution, risk-scores it across four signals, and allows, audits, asks for human approval, or blocks it—all on the user's machine with no external API. The tool addresses the risk of autonomous agents running shell commands, editing files, or making network calls without human oversight, preventing actions like secret exfiltration, excessive permissions, dangerous egress, and tool abuse.

A local-first security gateway for autonomous AI coding agents. Cerberus sits between the agent Claude Code, Codex, Cursor, Cline and your machine, intercepts every tool call before it runs, risk-scores it across four signals, and either allows, audits, asks for human approval, or blocks it — all on your machine, with no external API and nothing leaving the box. Autonomous coding agents run shell commands, edit files, and make network calls on your behalf — at machine speed, often unattended. One bad step rm -rf , an unwanted git push , a leaked .env , a poisoned README that tricks the agent into exfiltrating secrets and there's no human in the loop to stop it. Cerberus puts that checkpoint on the tool boundary , where the agent actually acts. PreToolUse ─▶ intercept ─▶ Policy + Behavioral + Content + Injection ─▶ Risk Engine ─▶ ALLOW · AUDIT · HITL · BLOCK PostToolUse ─▶ inspect ─▶ secret + injection detection ─▶ session contamination state Four deterministic signals aggregated into one weighted risk score, with a hard floor that absolute prohibitions can never override. 🟢 Secret exfiltration — detects secrets loaded into context, then content-matches the outbound payload : holds the call that actually carries the key raw or base64/hex/url-encoded , with provenance source: .env:4 · sha256:… · 97% and never logging the secret itself. 🟢 Excessive permissions — every call gated; unknown tools fail-closed; sensitive paths ~/.ssh , ~/.aws , credentials, /etc/passwd held; destructive commands rm -rf , Remove-Item -Recurse , chmod 777 , kill -9 blocked or held. 🟢 Dangerous egress — destination policy: trusted hosts registries, GitHub, OpenAI/Anthropic auto-allowed; paste sites / webhook catchers / raw-IP destinations held. 🟡 Tool abuse — runaway-loop and tool-call-rate/repetition detection. 🟡 Prompt injection — detects injection in tool results and gates the next egress heuristic classifier; optional local DeBERTa model . It sees tool calls, not the LLM prompt — so it catches the exploitation of an injection the egress , not the injection itself. Terminal-first approval — held calls surface in the agent's native permission prompt Claude Code / Cursor , or via cerberus approve <id / a localhost dashboard. Forensic dashboard — per-session timeline, risk-factor breakdown, and a Replay player that steps through how a session's risk built up. Multi-agent — one adapter layer serves Claude Code, Codex, Cursor, and Cline. Policy as data — rules and risk weights are editable YAML, not code. Local-first — binds to 127.0.0.1 , no external API, no telemetry; secret values never touch disk or logs. npm i -g @cerberussec/core or run ad-hoc with: npx @cerberussec/core <cmd wire Cerberus into your agent merges into the agent's config — backed up, idempotent : cerberus init Claude Code, project-level --agent codex|cursor|cline, --global, --print start the gateway + dashboard one process : cerberus engine then open http://127.0.0.1:9000/ Use your agent as usual — tool calls now route through Cerberus. By default a held HITL call is approved right in the terminal : Cerberus returns ask , so Claude Code shows its native permission prompt with Cerberus's reason — approve/deny without leaving your session. The dashboard http://127.0.0.1:9000/ has a Live tab Action Center + stream and a Sessions tab — a forensic timeline per session with a risk-factor breakdown and a Replay player to step through how a session's risk built up. Cerberus runs inside the agent's execution loop, so the terminal is the realtime decision point and the dashboard is the deep dive. Per severity default AG APPROVAL SURFACE=terminal : | verdict | terminal | web UI | |---|---|---| BLOCK | ⛔ denied in-terminal Claude shows the reason + optional auto-open | forensics | HITL | ✋ Claude's native permission prompt, with Cerberus's reason | forensics | AUDIT | — quiet | elevated-risk record | ALLOW | — silent | — | Prefer a central web queue instead? Set AG APPROVAL SURFACE=dashboard — held calls then pause on the engine's synchronous hold and you Approve/Deny from the dashboard or the terminal, out-of-band : cerberus pending list calls held for review with their ids cerberus approve <id release a held call … cerberus deny <id … or deny it Extra terminal alerts write to the controlling terminal /dev/tty , falling back to stderr so the protocol channel to Claude Code stays clean. Tune via env: | env | default | effect | |---|---|---| AG NOTIFY | 1 | extra terminal alert lines on/off 0 to silence | AG APPROVAL SURFACE | terminal | terminal ⇒ HITL via Claude's native prompt; dashboard ⇒ socket hold + dashboard approve | AG AUTO OPEN | off | block ⇒ auto-open the investigation UI on a BLOCK/EXFIL | The engine + signals + risk + dashboard are agent-agnostic; only a thin adapter parse the agent's hook event → normalize → emit its verdict shape is per-agent. Wire one with cerberus init --agent <name : | agent | --agent | HITL approval | notes | |---|---|---|---| Claude Code | claude default | native terminal prompt ask | verified end-to-end | Codex CLI | codex | dashboard hold no native ask — AG APPROVAL SURFACE=dashboard | enterprise requirements.toml makes it non-bypassable | Cursor | cursor | native IDE prompt ask | init sets failClosed: true | Cline | cline | dashboard hold cancel bool | macOS/Linux only | codex / cursor / cline adapters follow the published hook specs; verify against your installed version cerberus init --agent <name --print shows the exact config . Roo Code is unsupported archived 2026 . PreToolUse hook → is the single hard enforcement point allow/deny/ask; or HITL holds the socket open until you decide . /intercept PostToolUse hook → is observe-only: it updates the session's contamination state so the /inspect next action is judged with full context. It never modifies a tool result.- The engine is agent-agnostic at its core; per-agent adapters --agent are the only thing that differs. PreToolUse ─▶ /intercept ─▶ Policy + Behavioral + Content/Injection ─▶ RiskEngine ─▶ ALLOW/AUDIT/HITL/BLOCK PostToolUse ─▶ /inspect ─▶ secret detection + injection classifier ─▶ session contamination state audit log + WebSocket → dashboard Single Node + TypeScript package; the dashboard is a Vite/React app served by the engine. Rules and risk weights are editable YAML data , not code rules/ . Cerberus is a runtime gateway on the tool boundary . It's strongest at secret-exfiltration prevention and as a permission chokepoint. Because it sees tool calls not the LLM prompt , it catches the exploitation of a prompt injection — not the injection itself — and it does not cover data-pipeline / RAG poisoning. The exfil match is high-confidence but not airtight novel secret formats, split-across-calls encoding . Honest defaults over false guarantees. No external API, no API key, nothing leaves the machine. The optional injection model @cerberussec/injection-model /Adirdabush1/cerberus/blob/main/packages/injection-model , ProtectAI DeBERTa, Apache-2.0 upgrades the built-in heuristic classifier; install it only if you want it. The core is OSS-clean Apache/MIT-compatible deps ; Meta Prompt-Guard is deliberately kept out of core Llama license . from a clone: install root + dashboard are separate npm projects and build npm install && npm --prefix dashboard install npm run build compile the engine tsc → dist + dashboard vite → dashboard/dist npm run engine run from source via tsx dev npm run typecheck npm run test:behavioral && npm run test:content && npm run test:injection && npm run test:risk \ && npm run test:init && npm run test:projector && npm run test:audit && npm run test:notify \ && npm run test:security && npm run test:policy && npm run test:adapters npm run e2e:behavioral && npm run e2e:content && npm run e2e:injection && npm run e2e:risk See PLAN.md for milestones and brainstorms/ for the design records behind each decision.