Headroom – The context compression layer for AI agents

Headroom launches a context compression layer for AI agents that reduces token usage by 60–95% while preserving answer quality. The open-source tool offers multiple integration modes including a library, proxy, agent wrapper, and MCP server, and supports reversible compression with local caching.

██╗ ██╗███████╗ █████╗ ██████╗ ██████╗ ██████╗ ██████╗ ███╗ ███╗ ██║ ██║██╔════╝██╔══██╗██╔══██╗██╔══██╗██╔═══██╗██╔═══██╗████╗ ████║ ███████║█████╗ ███████║██║ ██║██████╔╝██║ ██║██║ ██║██╔████╔██║ ██╔══██║██╔══╝ ██╔══██║██║ ██║██╔══██╗██║ ██║██║ ██║██║╚██╔╝██║ ██║ ██║███████╗██║ ██║██████╔╝██║ ██║╚██████╔╝╚██████╔╝██║ ╚═╝ ██║ ╚═╝ ╚═╝╚══════╝╚═╝ ╚═╝╚═════╝ ╚═╝ ╚═╝ ╚═════╝ ╚═════╝ ╚═╝ ╚═╝ The context compression layer for AI agents 60–95% fewer tokens · library · proxy · MCP · 6 algorithms · local-first · reversible Docs https://headroom-docs.vercel.app/docs · Install get-started-60-seconds · Proof proof · Agents agent-compatibility-matrix · Discord https://discord.gg/yRmaUNpsPJ · llms.txt /headroomlabs-ai/headroom/blob/main/llms.txt · Enterprise /headroomlabs-ai/headroom/blob/main/ENTERPRISE.md AI agents / LLMs: read /llms.txt here, or fetch the live index / full docs blob. Headroom compresses everything your AI agent reads — tool outputs, logs, RAG chunks, files, and conversation history — before it reaches the LLM. Same answers, fraction of the tokens. /headroomlabs-ai/headroom/blob/main/HeadroomDemo-Fast.gif Live: 10,144 → 1,260 tokens — same FATAL found. Library — compress messages in Python or TypeScript, inline in any app Proxy — headroom proxy --port 8787 , zero code changes, any language Agent wrap — headroom wrap claude|codex|cursor|aider|copilot in one command MCP server — headroom compress , headroom retrieve , headroom stats for any MCP client Cross-agent memory — shared store across Claude, Codex, Gemini, auto-dedup— mines failed sessions, writes corrections to headroom learn CLAUDE.md / AGENTS.md Output token reduction — trims what the model writes back not just what you send : drops ceremony/restated code and skips deep "thinking" on routine steps. See Output token reduction output-token-reduction-cut-what-the-model-writes-back . Reversible CCR — originals are cached for retrieval on demand Your agent / app Claude Code, Cursor, Codex, LangChain, Agno, Strands, your own code… │ prompts · tool outputs · logs · RAG results · files ▼ ┌────────────────────────────────────────────────────┐ │ Headroom runs locally — your data stays here │ │ ──────────────────────────────────────────────── │ │ CacheAligner → ContentRouter → CCR │ │ ├─ SmartCrusher JSON │ │ ├─ CodeCompressor AST │ │ └─ Kompress-base text, HF │ │ │ │ Cross-agent memory · headroom learn · MCP │ └────────────────────────────────────────────────────┘ │ compressed prompt + retrieval tool ▼ LLM provider Anthropic · OpenAI · Bedrock · … ContentRouter — detects content type, selects the right compressor SmartCrusher / CodeCompressor / Kompress-base — compress JSON, AST, or prose CacheAligner — stabilizes prefixes so provider KV caches actually hit CCR — stores originals locally; LLM calls headroom retrieve if it needs them → Architecture https://headroom-docs.vercel.app/docs/architecture · CCR reversible compression https://headroom-docs.vercel.app/docs/ccr · Kompress-v2-base model card https://huggingface.co/chopratejas/kompress-v2-base 1 — Install pip install "headroom-ai all " Python npm install headroom-ai Node / TypeScript 2 — Pick your mode headroom wrap claude wrap a coding agent headroom proxy --port 8787 drop-in proxy, zero code changes or: from headroom import compress inline library 3 — See the savings headroom perf Granular extras: proxy , mcp , ml , code , memory , relevance , image , agno , langchain , evals , pytorch-mps Apple-GPU memory-embedder offload — set HEADROOM EMBEDDER RUNTIME=pytorch mps . Requires Python 3.10+ . Savings on real agent workloads: | Workload | Before | After | Savings | |---|---|---|---| | Code search 100 results | 17,765 | 1,408 | 92% | | SRE incident debugging | 65,694 | 5,118 | 92% | | GitHub issue triage | 54,174 | 14,761 | 73% | | Codebase exploration | 78,502 | 41,254 | 47% | Accuracy preserved on standard benchmarks: | Benchmark | Category | N | Baseline | Headroom | Delta | |---|---|---|---|---|---| | GSM8K | Math | 100 | 0.870 | 0.870 | ±0.000 | | TruthfulQA | Factual | 100 | 0.530 | 0.560 | +0.030 | | SQuAD v2 | QA | 100 | — | 97% | 19% compression | | BFCL | Tools | 100 | — | 97% | 32% compression | Reproduce: python -m headroom.evals suite --tier 1 · Full benchmarks & methodology https://headroom-docs.vercel.app/docs/benchmarks Everything above shrinks the prompt you send . But you also pay for every token the model writes back — and on Opus-class models output costs 5× input. A lot of that output is waste: "Great, let me…" preambles, re-printing code you just showed it, and deep "thinking" on routine steps like reading a file. Headroom can trim that too, from the proxy, without you changing any code: Verbosity steering — appends a short "be terse, don't restate context" note to the end of the system prompt so your prompt cache still hits . Effort routing — when a turn is just the model resuming after a tool result a file read, a passing test , it dials the model's thinking effort down. New questions and errors keep full effort. Turn it on: export HEADROOM OUTPUT SHAPER=1 off by default headroom proxy --port 8787 Already running a proxy?These switches are readliveon every request, so a proxy that headroom wrap reused rather than started would not see a value you export afterwards — its environment was snapshotted at launch. headroom wrap now hot-syncs your current settings to the running proxy via a loopback POST /admin/runtime-env , so they take effect immediately withno restart no cold start, no dropped requests, no lost caches . Set them before you wrap . On a shared proxy these overrides are global — the last explicit setting wins. Learn the right terseness for you. People don't say how terse they want answers — they show it they interrupt long replies, or move on before they could have read them . headroom learn --verbosity reads your past sessions and picks the level automatically: headroom learn --verbosity preview what it found dry run headroom learn --verbosity --apply save it; the proxy uses it from now on See how many output tokens you saved. Output savings are counterfactual — we never see what the model would have written — so Headroom reports an honest estimate with a confidence range , never a made-up number: headroom output-savings Reduction: 31.7% 95% CI 27.7% … 35.7% estimated Want a measured number instead of an estimate? Leave 10% of conversations unshaped as a control group: export HEADROOM OUTPUT HOLDOUT=0.1 . The dashboard shows an Output Tokens Saved card next to input compression, labelled measured or estimated with the confidence band. → Full write-up incl. the measurement methodology: docs/proposals/output-token-reduction.md | Agent | headroom wrap | Notes | |---|---|---| | Claude Code | ✅ | --memory · --code-graph | | Codex | ✅ | shares memory with Claude | | Cursor | ✅ | prints config — paste once | | Aider | ✅ | starts proxy + launches | | Copilot CLI | ✅ | starts proxy + launches | | OpenClaw | ✅ | installs as ContextEngine plugin | | Cortex Code | ✅ | 60–65% savings · library mode | Any OpenAI-compatible client works via headroom proxy . MCP-native: headroom mcp install . Headroom can route GitHub Copilot CLI subscription traffic through the local proxy: headroom copilot-auth login headroom wrap copilot --subscription -- --model gpt-4o This lets Headroom intercept OpenAI-compatible Copilot CLI requests and apply the same proxy compression pipeline before forwarding to GitHub Copilot's hosted API. The wrapper exchanges Headroom's reusable GitHub OAuth token for Copilot's short-lived API token and prints the upstream endpoint as COPILOT PROVIDER API URL=... during launch. headroom copilot-auth login stores a Headroom-specific Copilot OAuth token. This avoids relying on generic GitHub or Copilot CLI tokens that can read Copilot account metadata but may still be rejected by Copilot's token-exchange endpoint. For GitHub Enterprise Server or custom-domain Copilot deployments, set the deployment domain before launching: export GITHUB COPILOT ENTERPRISE DOMAIN=ghe.example.com For GitHub.com Enterprise Cloud URLs such as github.com/enterprises/your-enterprise , do not set an enterprise-domain override. Headroom uses GitHub's normal token-exchange endpoint and the Copilot API endpoint advertised for the signed-in account. Platform support note: macOS auth reuse via Copilot CLI Keychain storage has been smoke-tested. Windows Credential Manager, Linux Secret Service / secret-tool , and Docker/CI token-injection paths are implemented or planned as auth-discovery paths, but still need real OS validation before they should be considered fully vetted. For Docker and CI, prefer passing an explicit GITHUB COPILOT TOKEN or GITHUB COPILOT GITHUB TOKEN rather than relying on host keychain access. Great fit if you… - run AI coding agents daily and want savings without changing your code - work across multiple agents and want shared memory - need reversible compression — originals are retrievable via CCR within the configured TTL Skip it if you… - only use a single provider's native compaction and don't need cross-agent memory - work in a sandboxed environment where local processes can't run Integrations — drop Headroom into any stack | Your setup | Hook in with | |---|---| | Any Python app | compress messages, model=… | | Any TypeScript app | await compress messages, { model } | | Anthropic / OpenAI SDK | withHeadroom new Anthropic · withHeadroom new OpenAI | | Vercel AI SDK | wrapLanguageModel { model, middleware: headroomMiddleware } | | LiteLLM | litellm.callbacks = HeadroomCallback | | LangChain | HeadroomChatModel your llm | | Agno | HeadroomAgnoModel your model | | Strands | | app.add middleware CompressionMiddleware SharedContext .put / .get headroom mcp install What's inside SmartCrusher — universal JSON: arrays of dicts, nested objects, mixed types. CodeCompressor — AST-aware for Python, JS, Go, Rust, Java, C++. Kompress-base — our HuggingFace model, trained on agentic traces. Image compression — 40–90% reduction via trained ML router. CacheAligner — stabilizes prefixes so Anthropic/OpenAI KV caches actually hit. IntelligentContext — score-based context fitting with learned importance. CCR — reversible compression; LLM retrieves originals on demand. Cross-agent memory — shared store, agent provenance, auto-dedup. SharedContext — compressed context passing across multi-agent workflows.— plugin-based failure mining for Claude, Codex, Gemini. headroom learn Pipeline internals Headroom exposes one stable request lifecycle across compress , the SDK, and the proxy: Setup → Pre-Start → Post-Start → Input Received → Input Cached → Input Routed → Input Compressed → Input Remembered → Pre-Send → Post-Send → Response Received Transforms do the work: CacheAligner, ContentRouter, SmartCrusher, CodeCompressor, Kompress-base, IntelligentContext / RollingWindow. Pipeline extensions observe or customize lifecycle stages via on pipeline event ... . Compression hooks sit alongside the canonical lifecycle as an additional extension seam. Proxy extensions remain the server/app integration seam for ASGI middleware, routes, and startup policy. Provider and tool-specific behavior lives under headroom/providers/ so core orchestration stays focused on lifecycle, sequencing, and policy. CLI/tool slices : headroom/providers/claude , copilot , codex , openclaw Provider runtime slices : headroom/providers/claude , gemini , plus shared backend/runtime dispatch in headroom/providers/registry.py Core files stay orchestration-first : wrap.py , client.py , cli/proxy.py , and proxy/server.py delegate provider-specific env shaping, API target normalization, backend selection, and transport dispatch. pip install "headroom-ai all " Python, everything npm install headroom-ai TypeScript / Node docker pull ghcr.io/chopratejas/headroom:latest Granular extras: proxy , mcp , ml Kompress-base , code , memory , relevance , image , agno , langchain , evals , pytorch-mps Apple-GPU memory-embedder offload — set HEADROOM EMBEDDER RUNTIME=pytorch mps . Requires Python 3.10+ . Using pipx ? Choose a supported interpreter explicitly: pipx install --python python3.13 "headroom-ai all " → Installation guide https://headroom-docs.vercel.app/docs/installation — Docker tags, persistent service, PowerShell, devcontainers. headroom update detects pip / pipx / uv tool and upgrades in place headroom update --check report the latest release without upgrading headroom update --pre include pre-releases headroom update figures out how Headroom was installed pip/venv, pip --user , pipx, uv tool and runs the matching upgrade across macOS, Linux, and Windows. For git checkouts, editable installs, Docker images, and externally-managed system Pythons PEP 668 it prints the correct manual step instead of guessing. The proxy also shows a one-line "update available" notice on startup. It checks PyPI at most once a day, in the background, and never blocks. Opt out with HEADROOM UPDATE CHECK=off also skipped in --stateless mode and CI . If pip install "headroom-ai all " fails with CERTIFICATE VERIFY FAILED unable to get local issuer certificate , your network uses SSL inspection — a MITM proxy presenting a company-issued CA. The build backend maturin downloads rustup over a connection your TLS stack doesn't trust. Install Rust first so the build doesn't fetch it: macOS / Linux curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh && rustup default stable Windows winget install Rustlang.Rustup && rustup default stable Restart your shell, then pip install "headroom-ai all " . A prebuilt wheel avoids the Rust build entirely where available: pip install --only-binary headroom-ai headroom-ai . Two runtime assets are fetched over TLS; if they are blocked, trust your corporate CA via REQUESTS CA BUNDLE / SSL CERT FILE / CURL CA BUNDLE : — the ONNX Runtime for the Rust core. Alternatively pre-provide it with cdn.pyke.io ORT STRATEGY=system and ORT LIB LOCATION=/path/to/onnxruntime .— the huggingface.co kompress-base compression model. Pre-download it and run with HF HUB OFFLINE=1 , or set HF ENDPOINT to a trusted mirror. Running with compression disabled pure gateway requires neither asset. headroom learn — mines failed sessions, writes corrections to CLAUDE.md / AGENTS.md / GEMINI.md . | Start here | Go deeper | |---|---| | Architecture https://headroom-docs.vercel.app/docs/architecture Proxy https://headroom-docs.vercel.app/docs/proxy How compression works https://headroom-docs.vercel.app/docs/how-compression-works MCP tools https://headroom-docs.vercel.app/docs/mcp CCR — reversible compression https://headroom-docs.vercel.app/docs/ccr Memory https://headroom-docs.vercel.app/docs/memory Cache optimization https://headroom-docs.vercel.app/docs/cache-optimization Failure learning https://headroom-docs.vercel.app/docs/failure-learning Benchmarks https://headroom-docs.vercel.app/docs/benchmarks Configuration https://headroom-docs.vercel.app/docs/configuration Limitations https://headroom-docs.vercel.app/docs/limitations Headroom runs locally , covers every content type, works with every major framework, and is reversible . | Scope | Deploy | Local | Reversible | | |---|---|---|---|---| Headroom | All context — tools, RAG, logs, files, history | Proxy · library · middleware · MCP | Yes | Yes | | lean-ctx https://github.com/yvgude/lean-ctx Compresr https://compresr.ai , Token Co. https://thetokencompany.ai Attribution.Headroom ships with the excellent RTK binary for shell-output rewriting — git show --short , scoped ls , summarized installers. Huge thanks to the RTK team; their tool is a first-class part of our stack, and Headroom compresses everything downstream of it. Headroom can also use lean-ctx as the selected CLI context tool; set HEADROOM CONTEXT TOOL=lean-ctx before running headroom wrap ... . git clone https://github.com/chopratejas/headroom.git && cd headroom uv sync --extra dev && uv run pytest Devcontainers in .devcontainer/ default + memory-stack with Qdrant & Neo4j . See CONTRIBUTING.md /headroomlabs-ai/headroom/blob/main/CONTRIBUTING.md . — questions, feedback, war stories. Discord https://discord.gg/yRmaUNpsPJ — the model behind our text compression. Kompress-v2-base on HuggingFace https://huggingface.co/chopratejas/kompress-v2-base Apache 2.0 — see LICENSE /headroomlabs-ai/headroom/blob/main/LICENSE .