cd /news/large-language-models/prompt-caching-vs-the-long-llm-conve… · home topics large-language-models article
[ARTICLE · art-28751] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=↑ positive

Prompt caching vs the long LLM conversation: where your input bill actually hides

A developer built PromptCrunch, a drop-in proxy that reduces input token costs in long multi-turn LLM conversations by deduplicating code, compacting stale tool output, and summarizing old turns. In tests with Claude Code, input tokens dropped about 75% when caching was not covering the request and 7-10% when it was. The tool is designed for any long multi-turn application, including agents and chatbots.

read2 min views1 publishedJun 15, 2026

I kept watching my Claude Code bill climb through long sessions, and most of it was not new work. It was the same conversation getting re-sent every turn. A multi-turn call is stateless, so your client ships the whole history each time: file reads, tool output, old diffs, all of it, and you pay input tokens on that pile again and again.

So I built PromptCrunch. It is a drop-in proxy that optimizes the conversation before it reaches the model. Under the hood: it deduplicates superseded code, compacts stale tool output, summarizes old turns and reuses those summaries across requests, and preserves your recent turns and structured data verbatim, rewriting a request only when it nets out cheaper to send. Setup is two lines: swap your base_url, add one header. Your provider key still goes straight to your provider.

It is not just a Claude Code thing. The same re-sent history piles up in any long multi-turn app: agents, customer chatbots, conversational products. Claude Code is just where I run into it most.

Caching discounts the repeated prefix, and inside a hot window it does most of the work. But it expires after about 5 minutes and only covers the prefix. Real sessions are bursty. You work, you read, you step away, you come back. The cache goes cold in the gaps, and as the session grows the history runs past what caching holds.

That cold-cache window is where PromptCrunch pays for itself. On my own Claude Code runs, input tokens dropped about 75% when caching was not covering the request, and 7 to 10% when it was. The two stack: caching takes the hot window, PromptCrunch takes the long session.

On a long session you stop re-paying for history the model already processed, so the bill starts tracking the work instead of the turn count. You only pay on requests that came out smaller, so trying it costs you nothing. Your keys are never stored, and a zero-retention mode means we hold nothing at all.

It earns the most on long, multi-turn work, and not much on short prompts or one-shot calls, so point it where the sessions run long.

Point it at one real session and watch the per-request savings on your dashboard. You start with $5 of free credit, no card needed. Try it here.

── more in #large-language-models 4 stories · sorted by recency
── more on @promptcrunch 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/prompt-caching-vs-th…] indexed:0 read:2min 2026-06-15 ·