{"slug": "show-hn-kv-psi-using-linux-psi-to-to-trim-an-llm-kv-cache", "title": "Show HN: KV-psi, using Linux PSI to to trim an LLM KV cache", "summary": "A developer released KV-psi, a reference implementation that uses Linux Pressure Stall Information (PSI) to trim an LLM KV cache under memory pressure. Benchmarks on an NVIDIA Jetson showed PSI-based trimming reduced KV cache size by up to 35% while maintaining throughput, compared to a fixed cache policy.", "body_md": "PSI KV Governor is a small reference implementation for using Linux Pressure Stall Information to trim an LLM KV cache when the system is under memory pressure.\n\n- Linux with PSI enabled: cgroup\n`memory.pressure`\n\nor`/proc/pressure/memory`\n\n- Python 3.10+\n- llama.cpp build dependencies for the runner\n- a GGUF model, for example\n`models/SmolLM2-135M-Instruct-Q2_K.gguf`\n\nCheck PSI:\n\n```\ncat /proc/pressure/memory\nPYTHONPATH=src python benchmarks/pressure_bench.py --preflight-only\n```\n\nRun the reference simulator:\n\n```\nPYTHONPATH=src python -m psi_kv_governor.cli simulate\n```\n\nBuild the llama.cpp runner:\n\n```\nscripts/build_llama_runner.sh\n```\n\nDownload the small benchmark model if needed:\n\n```\npython scripts/download_demo_model.py\n```\n\nRun both variant orders. This matters because PSI `avg10`\n\n, cache, and zram/swap\nstate can carry over from the first pressure run into the second.\n\n```\nPYTHONPATH=src python benchmarks/pressure_bench.py \\\n  -c 2048 \\\n  -n 1536 \\\n  --keep 64 \\\n  --tail 256 \\\n  --min-prune 64 \\\n  --pressure-mib 6000 \\\n  --pressure-step-mib 1024 \\\n  --pressure-warmup-s 10 \\\n  --variant-cooldown-s 45 \\\n  --out-dir data/bench-pressure/fixed-first\n\nPYTHONPATH=src python benchmarks/pressure_bench.py \\\n  --variant-order psi-first \\\n  -c 2048 \\\n  -n 1536 \\\n  --keep 64 \\\n  --tail 256 \\\n  --min-prune 64 \\\n  --pressure-mib 6000 \\\n  --pressure-step-mib 1024 \\\n  --pressure-warmup-s 10 \\\n  --variant-cooldown-s 45 \\\n  --out-dir data/bench-pressure/psi-first\n```\n\nRecent Jetson result:\n\n| run | variant | decoded | tok/s | prunes | final KV | external PSI some/full |\n|---|---|---|---|---|---|---|\n| fixed-first | fixed | 1536 | 94.00 | 0 | 1547 | 1.61/1.61 |\n| fixed-first | PSI | 1536 | 88.80 | 4 | 1291 | 4.14/3.94 |\n| psi-first | PSI | 1536 | 96.16 | 2 | 1004 | 2.46/2.33 |\n| psi-first | fixed | 1536 | 89.76 | 0 | 1547 | 5.56/5.56 |\n\nResult directories:\n\n`data/bench-pressure/real-psi-6000m-1536tok-cooldown`\n\n`data/bench-pressure/real-psi-6000m-1536tok-cooldown-psi-first`", "url": "https://wpnews.pro/news/show-hn-kv-psi-using-linux-psi-to-to-trim-an-llm-kv-cache", "canonical_source": "https://github.com/infiniteregrets/kv-psi", "published_at": "2026-06-27 22:50:54+00:00", "updated_at": "2026-06-27 23:04:56.884704+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "ai-tools"], "entities": ["KV-psi", "PSI KV Governor", "Linux", "llama.cpp", "SmolLM2-135M-Instruct-Q2_K.gguf", "NVIDIA Jetson"], "alternates": {"html": "https://wpnews.pro/news/show-hn-kv-psi-using-linux-psi-to-to-trim-an-llm-kv-cache", "markdown": "https://wpnews.pro/news/show-hn-kv-psi-using-linux-psi-to-to-trim-an-llm-kv-cache.md", "text": "https://wpnews.pro/news/show-hn-kv-psi-using-linux-psi-to-to-trim-an-llm-kv-cache.txt", "jsonld": "https://wpnews.pro/news/show-hn-kv-psi-using-linux-psi-to-to-trim-an-llm-kv-cache.jsonld"}}