Every LLM call burns GPU cycles on tokens that never needed to run.
Padding. Boilerplate. Irrelevant context.
I built SuperCompress β a tiny CPU policy that cuts 65% of tokens before inference.
Open source. MIT. Free tier.
supercompress.vercel.app
The problem is worse than most people realize.
At ~50M agent turns/day:
β 100B tokens wasted daily
β 24K GPU hours
β 1,526 tons COβ
β 6.5M L cooling water
We're burning through resources on tokens that don't matter.
How it works:
1οΈβ£ Context + question β CPU policy (5K params)
2οΈβ£ Every line scored for relevance to the question
3οΈβ£ Low-scoring lines evicted
4οΈβ£ Only essential tokens reach the GPU
CPU first. GPU for what matters.
The numbers at 35% budget:
β’ 65% KV cache saved
β’ 100% oracle recall (vs 25% for truncation)
β’ ~60ms CPU latency
Same answers. β the compute.
Per 1 million compressions:
β 800M tokens avoided
β 29 kWh saved
β 12 kg COβ avoided
β 52 L cooling water saved
Scale that across the industry and it's enormous.
SuperCompress is:
β Open source (MIT) β Free API tier
β Python library
β Browser demo (no install) β Integration guides for OpenAI/LangChain
Try it: supercompress.vercel.app GitHub: github.com/arjunkshah/supercompress
Built this because I believe we can't scale AI by burning through what we have left.
Smarter compute means more AI for everyone β without the environmental cost.
Would love feedback from the community π
Links: GitHub | Live Demo | Interactive Tool