{"slug": "i-built-a-prompt-compressor-that-saves-65-on-llm-costs-here-s-the-story", "title": "I Built a Prompt Compressor That Saves 65% on LLM Costs — Here's the Story", "summary": "Developer Arjun Shah built SuperCompress, an intelligent prompt compression system for LLMs that saves 65% on token costs while achieving 100% oracle recall, outperforming standard truncation. The system uses a tiny CPU model to score context lines for relevance before GPU processing, potentially saving 24K GPU hours and 1,526 tons of CO₂ daily at industry scale. SuperCompress is available on PyPI and GitHub.", "body_md": "I've been working on a side project called **SuperCompress** — an intelligent prompt compression system for LLMs. The idea is simple: most tokens you send to an LLM never need to be processed. They're padding, boilerplate, irrelevant context. But they still burn GPU cycles.\n\nI wanted to fix that.\n\nWorking with LLM agents, I noticed something: every agent loop was sending massive context through the GPU. 10K tokens. 50K tokens. Sometimes more. Most of it was irrelevant to the specific task.\n\nTruncation (keeping head + tail) was the standard approach, but it regularly dropped critical information from the middle of the context.\n\nI thought: what if we could score each line of context for relevance BEFORE sending it to the GPU? A tiny CPU model that decides what matters.\n\nThe technical challenge was:\n\nAfter a lot of iteration, the results surprised even me:\n\n| Policy | KV Saved | Oracle Recall |\n|---|---|---|\n| Truncation | 65% | 25% |\n| H2O | 65% | 98% |\n| SuperCompress | 65% | 100% |\n\n100% oracle recall at the same token savings. The policy never dropped a line the answer depended on.\n\nHere's what hit me hardest: at 50M agent turns per day (a conservative estimate for the industry), we're wasting 100B tokens daily. That's 24K GPU hours, 1,526 tons of CO₂, 6.5M liters of cooling water. Every day.\n\nPer 1 million compressions, SuperCompress saves:\n\nIt's tiny per call. It's enormous at scale.\n\nCurrently looking for:\n\nLive demo: [https://supercompress.vercel.app](https://supercompress.vercel.app)\n\nGitHub: [https://github.com/arjunkshah/supercompress](https://github.com/arjunkshah/supercompress)\n\nDocs: [https://arjunkshah-supercompress-55.mintlify.app](https://arjunkshah-supercompress-55.mintlify.app)\n\n**The ask:** If you're building with LLMs, try compressing your next prompt. See if the answers stay the same. I'd love to hear what you think.\n\n**Now available on PyPI!** `pip install supercompress`", "url": "https://wpnews.pro/news/i-built-a-prompt-compressor-that-saves-65-on-llm-costs-here-s-the-story", "canonical_source": "https://dev.to/arjunkshah/i-built-a-prompt-compressor-that-saves-65-on-llm-costs-heres-the-story-2bdp", "published_at": "2026-06-26 19:45:49+00:00", "updated_at": "2026-06-26 20:03:54.746803+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "developer-tools"], "entities": ["Arjun Shah", "SuperCompress", "PyPI", "GitHub"], "alternates": {"html": "https://wpnews.pro/news/i-built-a-prompt-compressor-that-saves-65-on-llm-costs-here-s-the-story", "markdown": "https://wpnews.pro/news/i-built-a-prompt-compressor-that-saves-65-on-llm-costs-here-s-the-story.md", "text": "https://wpnews.pro/news/i-built-a-prompt-compressor-that-saves-65-on-llm-costs-here-s-the-story.txt", "jsonld": "https://wpnews.pro/news/i-built-a-prompt-compressor-that-saves-65-on-llm-costs-here-s-the-story.jsonld"}}