Prompt caching vs the long LLM conversation: where your input bill actually hides

wpnews.pro

cd /news/large-language-models/prompt-caching-vs-the-long-llm-conve… · home › topics › large-language-models › article

[ARTICLE · art-28751] src=dev.to ↗ pub=2026-06-15T23:48Z topic=large-language-models verified=true sentiment=↑ positive

Prompt caching vs the long LLM conversation: where your input bill actually hides

A developer built PromptCrunch, a drop-in proxy that reduces input token costs in long multi-turn LLM conversations by deduplicating code, compacting stale tool output, and summarizing old turns. In tests with Claude Code, input tokens dropped about 75% when caching was not covering the request and 7-10% when it was. The tool is designed for any long multi-turn application, including agents and chatbots.

read2 min views21 publishedJun 15, 2026

I kept watching my Claude Code bill climb through long sessions, and most of it was not new work. It was the same conversation getting re-sent every turn. A multi-turn call is stateless, so your client ships the whole history each time: file reads, tool output, old diffs, all of it, and you pay input tokens on that pile again and again.

So I built PromptCrunch. It is a drop-in proxy that optimizes the conversation before it reaches the model. Under the hood: it deduplicates superseded code, compacts stale tool output, summarizes old turns and reuses those summaries across requests, and preserves your recent turns and structured data verbatim, rewriting a request only when it nets out cheaper to send. Setup is two lines: swap your base_url, add one header. Your provider key still goes straight to your provider.

It is not just a Claude Code thing. The same re-sent history piles up in any long multi-turn app: agents, customer chatbots, conversational products. Claude Code is just where I run into it most.

Caching discounts the repeated prefix, and inside a hot window it does most of the work. But it expires after about 5 minutes and only covers the prefix. Real sessions are bursty. You work, you read, you step away, you come back. The cache goes cold in the gaps, and as the session grows the history runs past what caching holds.

That cold-cache window is where PromptCrunch pays for itself. On my own Claude Code runs, input tokens dropped about 75% when caching was not covering the request, and 7 to 10% when it was. The two stack: caching takes the hot window, PromptCrunch takes the long session.

On a long session you stop re-paying for history the model already processed, so the bill starts tracking the work instead of the turn count. You only pay on requests that came out smaller, so trying it costs you nothing. Your keys are never stored, and a zero-retention mode means we hold nothing at all.

It earns the most on long, multi-turn work, and not much on short prompts or one-shot calls, so point it where the sessions run long.

Point it at one real session and watch the per-request savings on your dashboard. You start with $5 of free credit, no card needed. Try it here.

source & further reading

dev.to — original article Hardening an AI coding agent: the failures, and the code that fixed them Gemini Robotics 2 Has Not Been Announced: What Google DeepMind Actually Offers A user spent four days designing a feature for my project. The right answer was zero lines of code.

~/api · this article 200

$curl api.wpnews.pro/v1/news/prompt-caching-vs-the-lo…

Read original on dev.to → dev.to/promptcrunch/prompt-caching-vs-the-long-l…

mentioned entities

PromptCrunch

Claude Code

Anthropic

metadata

slugprompt-caching-vs-the-long-llm-conversation-where-your-input-bill-actually-hides

topic#large-language-models

secondary2 topics

sentimentpositive

canonicaldev.to

navigation

← prevCommunication-Efficient Vertical…

next →Cybersecurity Leaders Ask US to …

── more in #large-language-models 4 stories · sorted by recency

mlq.ai · 31 Jul · #large-language-models

Banks Weigh $15 Billion Loan for Anthropic-Leased Texas Data Center

dev.to · 31 Jul · #large-language-models

How BrowserAct Fixed the Stale-Selector Failures Breaking My Browser Tasks

dev.to · 31 Jul · #large-language-models

The Human Is Not the Bottleneck. The Human Is the Missing Oracle

startupfortune.com · 31 Jul · #large-language-models

AMD surges 13% as Microsoft Azure's $100 billion milestone resets the AI spending debate

── more on @promptcrunch 3 stories trending now

wpnews · 30 Jul · #artificial-intelligence

Microsoft and Meta Earnings Show Different AI Spending Pressures

wpnews · 31 Jul · #artificial-intelligence

Rewriting a Six-Year-Old Personal Project with AI

wpnews · 31 Jul · #artificial-intelligence

Microsoft doubles down on multi-model AI as it builds a Copilot super app

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required