Local firewall for AI Agents that cuts tokens usage and cost by 40–70% Guardian Runtime, a local-first security middleware and FinOps firewall for AI agents, now intercepts every prompt and response locally to stop data leaks and runaway token costs. The tool tracks token usage, enforces daily budget limits, and reduces output tokens by 40–70% through aggressive context optimization while scanning for secrets and PII before they reach cloud LLM providers. A Zero-Latency FinOps & Security Firewall for AI Applications. Intercept every prompt and response locally. Stop data leaks and runaway token costs. 🌐 Website & Docs: https://ashp15205.github.io/guardian-runtime/ https://ashp15205.github.io/guardian-runtime/ 📦 Available on PyPI: https://pypi.org/project/guardian-runtime/ https://pypi.org/project/guardian-runtime/ 🛑 The Core Problem: Why You Need Guardian -the-core-problem-why-you-need-guardian 🟢 The Solution: What is Guardian Runtime? -the-solution-what-is-guardian-runtime 🏗 Architecture -architecture 🚀 Quickstart & Installation -quickstart--installation 🎯 Comprehensive Use Cases Where & How to Use -comprehensive-use-cases-where--how-to-use 💻 Complete CLI Command Reference -complete-cli-command-reference ⚙️ Advanced Configuration Policy YAML %EF%B8%8F-advanced-configuration-policy-yaml 📜 License -license As AI coding agents Claude Code, Cursor, Aider become standard developer tools, they introduce two massive, hidden risks, and one regulatory headache: Autonomous agents operate in loops. If an agent gets stuck retrying a bug fix or accidentally dumps a massive 1GB log file into its context window, you can wake up to a $100 API bill overnight . The Problem: You have zero visibility or control over session costs until the provider's bill arrives at the end of the month. Coding agents require full local codebase access to be useful. However, if you accidentally leave an AWS SECRET KEY or a database password in a .env file, the agent will silently upload it to a third-party LLM provider OpenAI, Anthropic . The Problem: Current observability tools like Langfuse only log the leak after the credentials have already reached the cloud. Sending unauthorized PII like SSNs or emails in a test database to foreign LLM APIs violates GDPR and DPDP regulations. Guardian Runtime is a local-first security middleware and FinOps firewall . It runs entirely on your local machine and intercepts LLM traffic before it leaves your infrastructure. | The Problem | How Guardian Solves It | |---|---| Cost Runaways | Hard FinOps Budgets & Optimization: Tracks every token you spend locally. You can set a strict "$5.00 per day" limit. Advanced Terse Mode aggressively optimizes input context and provides output brevity enforcement via system prompt injection. In benchmarks across real developer prompts, it reduces output tokens by 40–70% while maintaining full technical accuracy. | Data Exfiltration | Zero-Latency Secret Scanners: Scans every prompt for API keys, AWS credentials, and secrets locally. If it detects a secret, it instantly drops the request before it reaches the internet. | Compliance | Local PII Blocking: Regex and ML scanners prevent PII from leaving your machine. | Guardian intercepts traffic at the network layer or via SDK, passing it through a strict verification pipeline before it ever reaches the cloud. Agent / Dev Guardian Runtime Cloud LLM │ │ │ │ 1. Prompt + Context │ │ │ ──────────────────────────▶ │ │ │ │ │ │ │ Security Firewall │ │ │ ├─ Scan AWS Keys / Secrets │ │ │ └─ Block if Threat Detected ──┼─ Drops Request │ │ │ │ │ Token Optimizer │ │ │ ├─ Compress Whitespace │ │ │ └─ Terse Mode Output Trim │ │ │ │ │ │ FinOps Budget │ │ │ ├─ Check Daily Spend Limit │ │ │ └─ Block if $5 Limit Hit ─────┼─ Drops Request │ │ │ │ │ 2. Sanitized Prompt │ │ │ ────────────────────────────▶ │ │ │ │ │ │ 3. LLM Response │ │ │ ◀──────────────────────────── │ │ │ │ │ │ Output Guard │ │ │ Audit for Leaked PII/Secrets │ │ │ │ │ 4. Safe Response │ │ │ ◀────────────────────────── │ │ │ │ │ Guardian Runtime acts as an HTTP proxy or a native Python SDK, meaning it integrates effortlessly with almost any modern AI tool without modifying their internal code. Visual IDEs: Cursor, Windsurf, VS Code via Cline/RooCode Terminal Agents: Claude Code, Aider, GitHub Copilot CLI Frameworks: LangChain, AutoGen, LlamaIndex, CrewAI LLM Providers: OpenAI, Anthropic, Google Gemini via OpenAI compatibility layer Supported Models: Claude Fable 5, Claude Opus 4.8, GPT-4o, Gemini Core framework only pip install guardian runtime Or install with specific LLM providers: pip install "guardian runtime openai " pip install "guardian runtime anthropic " pip install "guardian runtime gemini " Or install everything Providers, ML Scanner, Document Converter : pip install "guardian runtime all " Done. No signup, no keys, zero configuration required. All monitoring data stays on your local machine in ~/.guardian runtime/. Guardian is designed to be universal. Here are the exact ways to deploy it based on your workflow. Why use it here? CLI agents operate autonomously. They can accidentally read a .env file containing your production AWS keys and send it to Anthropic/OpenAI as context. Guardian prevents this and ensures the agent doesn't blow your budget. How to use: - Start the proxy in a background terminal: guardian runtime proxy --port 8080 - Tell your agent to route traffic through the proxy using environment variables: In PowerShell: $env:ANTHROPIC BASE URL="http://localhost:8080" claude In Mac/Linux/Git Bash: export ANTHROPIC BASE URL=http://localhost:8080 claude Why use it here? Modern GUI editors like Cursor have deep codebase access. While coding, you might highlight a file containing a secret and ask "explain this file". Guardian stops Cursor from sending that secret to the cloud. How to use Cursor Example : - Start the proxy in your terminal: guardian runtime proxy --port 8080 - Open Cursor Settings Cmd/Ctrl + , - Navigate to Models Override Base URL - Set the Base URL to: http://localhost:8080 Now all of Cursor's traffic is protected and tracked locally Why use it here? If you are building a production chatbot or RAG pipeline, you must ensure your users cannot perform "jailbreak" prompt injections or trick the LLM into leaking internal system prompts. How to use: Use Guardian as a drop-in replacement for the OpenAI/Anthropic SDK. python import os from guardian runtime import GuardianRuntime, GuardianRuntimeBlockedError os.environ "OPENAI API KEY" = "sk-proj-..." gr = GuardianRuntime Zero-config initialization try: Protects user input before sending to OpenAI response = gr.complete messages= {"role": "user", "content": "My AWS Key is AKIAIOSFODNN7EXAMPLE"} , raise on block=True print response.content except GuardianRuntimeBlockedError as e: Fails cleanly in your app instead of leaking the secret print f"Blocked Locally: {e.response.violations 0 .detail}" Why use it here? Frameworks that spawn multiple communicating agents can rapidly consume tokens. Guardian acts as a central cost-tracking hub for all agent nodes. How to use: Point your framework's base url to the local proxy. python from langchain openai import ChatOpenAI llm = ChatOpenAI model="gpt-4o", base url="http://localhost:8080", Traffic routes through Guardian api key="sk-proj-..." response = llm.invoke "Hello, Guardian " Why use it here? If you use the standard ChatGPT or Claude Web UI, uploading large PDFs eats up your context window quickly because PDFs contain massive amounts of hidden formatting bloat. How to use: Use the built-in CLI to strip out formatting bloat and compress documents into pure Markdown before manually uploading them. guardian runtime convert