cd /news/large-language-models/claude-code-costs-act-iii-the-ecosys… · home topics large-language-models article
[ARTICLE · art-40374] src=dev.to ↗ pub= topic=large-language-models verified=true sentiment=· neutral

Claude Code Costs, Act III — The ecosystem of options for spending less

A developer analyzed the open-source ecosystem for reducing LLM costs, focusing on three cost lines: cached input, uncached/written input, and output. The key insight is that tools must preserve the model-scoped prompt cache to achieve real savings, as rewriting the request prefix per turn forfeits the cache. Headroom, a compression library and proxy, was highlighted for its ability to shrink tool output upstream of the cache without breaking it, though independent benchmarks show modest savings of around 10% on-wire.

read22 min views1 publishedJun 26, 2026

There is a whole open-source ecosystem aimed at cutting LLM cost. The trick to evaluating any of it is to ask which of the three cost lines it attacks — cached input (0.1×), uncached/written input (1×/2×), or output (the priciest, never cached) — and then whether it does so without forfeiting the model-scoped prompt cache.

That last clause is the recurring catch, and it's the throughline of this entire guide:

Anything that rewrites the request prefix per turn forfeits the model-scoped prompt cache.A tool nets real savings only if it (a) preserves the cached prefix byte-for-byte, or (b) attacks a cost line the cache doesn't cover — namelyoutput.

We'll walk the ecosystem in cost-line order: input/cache compressors, then the output compressor, then the model routers, then what routing actually costs inside Claude Code.

Headroom (headroomlabs-ai/headroom

, formerly chopratejas/headroom

; Apache-2.0) is a local-first compression library + proxy + MCP for coding agents, and the most deeply-engineered tool in this category. (It's also brand-new and fast-moving — created 2026-01, dozens of commits a day — so treat specifics as a moving target.) Three things make it notable:

CacheAligner

Self-reported results: 47–92% token savings with accuracy held (GSM8K ±0, SQuAD 97% at 19% compression, BFCL tools 97% at 32%) [vendor self-reported]. Independent measurement is far lower: the one rigorous third-party benchmark found ~10% on-wire savings, matching Headroom's own published fleet median of 4.8% — an order of magnitude below the headline. [independent benchmark, 2026-06]

Headroom's content-type routing [med]:

Content Compressor Savings Preserves
JSON arrays SmartCrusher 70–90% keys, UUIDs, booleans
Search results SearchCompressor 80–95% matched lines
Build/test logs LogCompressor 85–95% errors, stack traces
Source code CodeCompressor (AST) 40–70% signatures, imports
Diffs / HTML / text Diff/HTML/Text 50–80% high-entropy tokens
Images ML router 40–90%

How Headroom actually saves tokens — two surfaces. Read that table with one distinction or it misleads: those 70–95% ratios are achieved on tool output, not on the conversation the proxy forwards. Headroom operates on two surfaces, with opposite cache rules:

/v1/messages

body Claude Code sends.So the mental model is: Headroom shrinks tool output upstream of the cache (free and safe) and refuses to rewrite the conversation at the cache (which would be a cache-buster). The cache-killer checklist below is why that second surface has to be passthrough.

What "compose" means here. Compression and caching are two separate discounts on the token bill, and the question is whether they stack or cancel. Caching is a 90%-off coupon on bytes the server has already seen — a cached prefix is billed at 0.1× (a "cache read") instead of full price. Compression just means send fewer bytes. Whether they compose (both apply) or fight (one voids the other) depends entirely on which bytes you compress:

What you do Cost of a 100K-token cached prefix
Fight
compress the whole prefix 100K → 20K, rewriting the cached bytes those 20K are now new bytes → cache miss → billed as a 2× write ≈ 40K-equiv. You shrank the text but voided the coupon, and the bill went up (it was 10K-equiv as a read).
Compose
leave the cached span byte-identical, compress only the fresh tail
the 100K stays a 0.1× read ≈ 10K-equiv and the new tail is smaller — both discounts apply.

So "composing compression with the cache" means arranging things so compression shrinks tokens and the cache discount survives. It takes two separate things, and only one of them is the CacheAligner

:

CacheAligner

's entire job: it moves volatile content to the message tail. CacheAligner

adds little behind CC. It earns its keep for CacheAligner

actually ships CacheAligner

, and behind Claude Code it's the half that actually matters: get it wrong and the compressor costs more than it saves — usually more than running no compressor at all, since a multi-turn CC session's cached prefix dominates the bill. It's exactly the bug list below.The cache is a byte-exact longest-prefix match (Act I): the API extends a cached entry only while the new request's leading bytes are identical, and re-processes everything from the first differing byte onward. Headroom runs as a local proxy in that path, between Claude Code and the API — and the hazard of sitting there is that it can change those bytes without changing the meaning, which is enough to miss.

To see how, follow one request through it. There are three nodes:

[A] Claude Code  ──raw bytes──▶  [B] Headroom (proxy)  ──re-emitted bytes──▶  [C] Anthropic API
   (the client)                  (logs / compresses)                          (server + cache)

A — Claude Code builds the request. It serializes the whole request to JSON bytes (model

, tools

, system

, messages

), writing compact JSON — no space after :

or ,

— and raw UTF-8 for non-ASCII. A content

field reaches Headroom as these literal bytes:

{"role":"user","content":"café"}

B — Headroom re-emits the request. To compress a tool output (or merely to log the body), the natural code parses the incoming bytes into an object, edits its one field, and serializes them back — and Headroom's encoder is not Claude Code's. Python's default json.dumps()

re-emits the same object as:

{"role": "user", "content": "caf\u00e9"}

Two unrequested changes: a space after every :

and ,

, and é

becomes the 6-character escape \u00e9

. Same object, two valid byte encodings — Headroom just picked a different one than Claude Code did. Headroom calls this serialization drift, and its REALIGNMENT/01-bug-list.md is a 72-item P0–P6 audit of every way it occurs — each entry matching a real Anthropic prefix-cache rule.

C — Anthropic looks up the cache. This is the step a proxy author gets wrong. The cache keys on the bytes the server receives — Headroom's output — and never sees Claude Code's originals at all. So each turn the server matches this request's prefix against the bytes Headroom itself sent, and the server cached, on the previous turn — not against anything Claude Code produced. If Headroom re-serializes byte-identically every turn, those match and you still hit; the cache simply keys on Headroom's encoding instead of Claude Code's. The bill blows up only when Headroom's output is not byte-stable from one turn to the next — and naive re-serialization makes that the default: json.dumps()

without sort_keys

, set

/dict

iteration order, drifting float formatting, or byte-copying some messages while re-serializing others. Any of those changes a byte near the top of the request, which re-keys the whole prefix after it → a cold write at 2× instead of a cache_read

at 0.1×. (You also pay a one-time cold rebuild the moment Headroom is dropped in front of a cache Claude Code had already warmed with compact bytes — its spaced output matches none of those entries.)

What that costs:

cache_read

stuck near 0 and cache_creation

high every turn. You find it in the bill, not a stack trace.Each row below is another way Headroom's output stops being byte-stable across turns — and each is a real entry in that bug list, so you can look it up. The Headroom bug column gives the ID (P0 = "cache-killer smoking guns — every customer affected"):

Drift cause What the proxy did What changed in the bytes Headroom bug
Separator / escaping drift
re-encoded with default json.dumps()
,, , :: , raw UTF-8 → \uXXXX
P0-2 (also P0-1 — system prompt .strip() + memory append)
Key-order drift
json.dumps() without sort_keys=True , or iterated a set /dict
object keys come out in a different order request-to-request
P3-28 (tool array), P3-29 (schema keys)
Numeric-precision drift
round-tripped a number through a generic JSON value
1.01 ; integers above 2⁵³ lose low digits
P0-5 (also P4-46 — missing serde_json arbitrary_precision /raw_value )
Re-emitting unchanged history
re-serialized retained messages during compression instead of byte-copying them even "untouched" history shifts byte-for-byte
P1-13 (also P0-3 ignored cache_control ; P0-4 ICM dropped hot-zone messages)

All four are the silent invalidators from Act I, now seen from the proxy author's chair. They collapse to one rule: forward the original request bytes verbatim. If you must mutate, do surgical byte-fragment replacement on the messages

tail only — never parse-and-re-emit the whole envelope — and confirm cache_read

is unchanged on the next request. A compressor that ignores this loses far more to cold rewrites than it ever saves in tokens.

⚠️ Mistake — adding a "compression" or "logging" proxy that re-serializes JSON.A re-encode stays cache-safe only if it reproduces the exact same bytes oneveryturn — and naive code never does (separators, escaping, key order, float formatting, or mixing byte-copied and re-serialized messages all drift). In practice the hit rate falls to ~0 forallproxied traffic. This is the most expensive proxy mistake there is.

✅ Fix— Forward the original request bytes verbatim. Mutate only by surgical byte-fragment replacement on themessages

tail, never by parse-and-re-emit of the whole envelope. Verify withcache_read

after the proxy is inserted.

What's off on the Claude Code path — and why it matters for your bill. Headroom is the most-engineered compressor in this category, and its own history is the cleanest proof of the rule above. In PR #349 (2026-05) the maintainers made the Anthropic /v1/messages

endpoint a byte-faithful passthrough — disabling conversation compression on the Claude Code path entirely — to eliminate four cache-killer bugs. The CacheAligner

is off in the same spirit: it ships disabled and slated for removal, because its rewrite was found to mutate the very cache hot zone it was meant to protect. Two consequences for your bill: (1) pointing ANTHROPIC_BASE_URL at Headroom does almost nothing to a Claude Code conversation by itself — the proxy forwards it verbatim, so savings materialize only if you also route verbose tool output through Headroom's MCP server / hooks (the compression surface above); and (2) the things that

Two more ways a proxy in front of Claude Code quietly costs you — independent of compression, just from pointing ANTHROPIC_BASE_URL

at a custom host:

ENABLE_TOOL_SEARCH=true

(or headroom wrap claude

, which sets it). [Headroom troubleshooting docs + issue #753]context-1m-2025-08-07

beta and [1m]

model suffix (e.g. claude-opus-4-8[1m]

). [issue #1158]Same cost line as Headroom (input / cache-write), shipped as point tools:

microsoft/LLMLingua

) — prompt compression up to 20× at ~1.5pt GSM8K loss; rtk-ai/rtk

) — rewrites verbose git show

, ls

, installers) into terse forms; the productized version of the preprocessing-hook technique. Good when tool outputs are the bloat. yvgude/lean-ctx

) — a CLI/MCP context trimmer. Hosted & provider-native (text leaves your box; non-reversible):

⚠️ Mistake — running an aggressive input compressor over a cached conversational prefix.Past ~20× compression accuracy degrades, and any rewrite of already-cached bytes flips reads to writes. You can spend more on rebuilds than you save on tokens.

✅ Fix— Use input compressors where thereisno cache to lose (one-shot calls, RAG context assembly) or only on the fresh tail. On a cached prefix, prefer a cache-aware tool (Headroom'sCacheAligner

).

Output is the most expensive class and is never cached at generation, so compressing it never busts the prefix cache — and it actually helps it: because each response is cache-written as input on the next turn (see What the cache stores), shrinking the output also shrinks every downstream cache-write and cache-read it would have become. So the win compounds — 5× saved on the output now, plus smaller writes/reads on every later turn — with no risk to the cached prefix. This makes it, alongside sticky routing, one of the two safest levers in the entire ecosystem.

Caveman (JuliusBrussee/caveman

, dlepold/caveman-distillate

, and forks) is a Claude Code skill that instructs the model to answer in telegraphic, low-grammar prose — "why use many token when few token do trick": short sentences, no filler, dropped articles. Community-reported ~65% output-token reduction, corroborated across multiple independent repos (repo-benchmarked at ~65%, range 22–87% across 10 prompts). Best for internal/agent-facing output where terseness is fine; the trade-off is human readability. [community-reported]

Caveman's savings come from prompt injection — the same context-shaping that governs caching — and its central engineering problem (keeping an instruction alive as context compression prunes it) is the flip side of the prefix-stability story from Acts I–II. Because the tool is a master class in how a Claude Code plugin actually works, the full anatomy is worth studying.

Caveman is not a model or a service — it's a prompt-injection + distribution system. One idea: inject a compression ruleset into a coding agent's context so it writes terse "caveman-style" prose (drop articles, filler, hedging; keep every piece of technical substance), cutting ~65% of output tokens at full accuracy.

The repo is two codebases:

The npm package is caveman-installer

; the shipped artifact is the installer, not a model. Maintainers edit only the source-of-truth files (skills/*/SKILL.md

, agents/cavecrew-*.md

, src/rules/*.md

); CI mirrors them into the plugin distribution and rebuilds the release ZIP.

skills/caveman/SKILL.md

is the single source of truth for behavior:

lite

, full

(default), ultra

, plus three classical-Chinese wenyan-*

variants, each with worked examples.cavecrew is a separate play: three subagent presets (cavecrew-investigator/builder/reviewer

) that emit caveman output. The win here is main-context longevity, not output cost — a subagent's result is injected verbatim into the main thread, so a 2k-token prose result costs 2k of main-context budget; the caveman version returns ~700. Each has a strict, greppable output contract (e.g. investigator: path:line — symbol — note

).

Claude Code spawns a fresh process per hook event and keeps no plugin code resident, so the hooks share no memory — they coordinate through one ~12-byte file, $CLAUDE_CONFIG_DIR/.caveman-active

(contents = the active mode string):

 SessionStart            UserPromptSubmit          Statusline
 caveman-activate.js     caveman-mode-tracker.js   caveman-statusline.sh/.ps1
       │                        │                         │
       └──── writes mode ───►  .caveman-active  ◄── reads ┘ (read-only)

caveman-activate.js

SKILL.md

at runtime and filters the 6-level table down to the active level only (saves injected tokens, avoids confusing the model with inactive levels); and appends a statusline nudge if none is configured (how a plugin, which can't write settings.json

, bootstraps its badge).caveman-mode-tracker.js

/caveman-stats

interception (blocks the prompt and returns stats via {decision:"block"}

— a zero-token slash command); slash-command parsing; NL deactivation (unlink the flag); and caveman-statusline.sh/.ps1

[a-z0-9-]

, whitelist-matches a known mode, and renders [CAVEMAN:MODE]

plus a savings suffix.The security spine — caveman-config.js. All hook filesystem I/O funnels through one module because the flag path is predictable and user-writable — the canonical local-attacker setup.

safeWriteFlag

uses O_NOFOLLOW

  • atomic temp-write + rename + 0600

, and refuses if the flag itself is a symlink (the clobber vector) while still tolerating a legitimately symlinked parent dir (resolve, then verify uid ownership). readFlag

is symmetric (symlink-refuse, 64-byte cap, VALID_MODES

whitelist → returns null on any anomaly). Without this, an attacker could symlink the flag to ~/.ssh/id_rsa

, or fill it with ANSI/OSC escapes the statusline would render on every keystroke; the whitelist guarantees the only renderable outputs are the ~11 known mode strings. src/hooks/package.json

pins {"type":"commonjs"}

so the hooks load even when an ancestor package.json

declares "type":"module"

."Off" is represented by file absence: deactivation unlinks the flag, and readers treat "missing" and "unreadable" identically to inactive — failing safe toward off, never toward leaking content.

The matcher in caveman-mode-tracker.js

has three OR'd branches: (A) verb-before-noun (activate|enable|turn on|start|talk like … caveman

), (B) noun-before-keyword (caveman … mode|activate|enable|…

), and (C) brevity requests with no "caveman" at all (less tokens|fewer tokens|be brief|be terse|shorter answers

). A negative guard (!/stop|disable|turn off|deactivate/

) is load-bearing: "turn off caveman mode" matches Branch B, but the guard suppresses the write so a deactivation request can't accidentally re-arm the mode.

Two deliberate asymmetries: activation never carries a level (it always resolves the default — "talk like caveman ultra" does not give ultra; in an ultra-pinned repo, "be brief" silently activates ultra), and it respects off as a configured default (even "talk like caveman" writes nothing if the configured default is off — configured intent beats an ad-hoc phrase). Firing order is fixed — NL activation → stats → slash → NL deactivation → reinforcement — and because deactivation runs last and unconditionally deletes, it

Verified gap: the opencode port (src/plugins/opencode/plugin.js

) reimplements the logic but has no Branch C — "less tokens" / "be brief" do not activate caveman under opencode, a genuine inconsistency with the Claude Code hook and the README promise. A clear fix-or-document candidate.

--append-system-prompt

? This is a useful design lesson about Claude Code's injection paths. --append-system-prompt

fails Caveman's constraints; a SessionStart hook satisfies them:

plugin.json

has no systemPrompt

/appendSystemPrompt

field (the keys are name

/description

/author

/hooks

). SessionStart stdout-as-context is the only plugin-declarable always-on injection path.caveman-activate.js

computes the injection each session (mode resolution, SKILL.md

read, level filtering, conditional statusline nudge); a static launch string can't.--append-system-prompt

is per-invocation; a SessionStart hook fires forever after one install.SKILL.md

live; a static append would drift.The honest counterpoint: --append-system-prompt

sits higher in precedence (it's appended to the real system prompt; SessionStart stdout is a lower-precedence system reminder), so a static append would be stickier in principle. Caveman gives that up and compensates with per-turn reinforcement — which arguably survives context compression better because it's continuously re-injected. Net: the SessionStart hook wins not as a better injection primitive in the abstract, but as the only one that is plugin-distributable, dynamic, persistent, and switchable.

full

mode.A model router looks at each turn and sends it to a different model based on how hard the turn looks — easy turns to a cheap model, hard turns to a strong one — to lower the bill.

lm-sys/RouteLLM

) — a trained router; reports ~85% cost reduction at ~95% of GPT-4 quality on MT Bench (45% on MMLU, 35% on GSM8K). The figures are GPT-4-era and benchmark-specific. vllm-project/semantic-router

) — routes across local/private/frontier models by cost, latency, privacy, and safety. The idea is sound for stateless, one-shot calls. Inside a stateful coding agent like Claude Code, switching models mid-conversation ("dynamic" or per-request routing) has three costs that don't show up on the per-token price tag — you only see them when you measure:

Because of #2, judge routing by outcomes, not by the token price. A router can make every turn cheaper and still be a net loss if the weaker reasoning makes you take more human turns to get the same result — three cheap turns plus your time to re-steer the agent can cost more than one good expensive turn. So put evals in place (task success rate, turns-to-done, human-correction rate) and watch them the moment you enable routing. If the human experience gets worse, the token savings are a mirage.

⚠️ Mistake — routing per request inside one conversation.It forfeits the model-scoped cache, makes you carry-and-bill the original model's (encrypted, unverifiable) reasoning on the new model, and can miss the cache even on the return trip. Measured below: bouncing 20% of turns to Sonnet actuallylosesmoney.

✅ Fix— Routesticky(per conversation or per sub-agent) so each model reads its own warm cache, or use routers only for stateless single-shot traffic where there's no cache to lose. Either way, gate it behind evals so a cheaper bill can't hide a worse experience.

The routers above assume you can swap models freely. Inside Claude Code you can't: the request body is co-designed for one target model, and routing it elsewhere fails in escalating ways before you even reach the cache economics.

A naive proxy that rewrites only the model

field hits a wall fast. req 0001 → Sonnet (200); req 0002 → rerouted to Haiku → ** 400: "adaptive thinking is not supported on this model"** → Claude Code aborts. Claude Code sends

thinking:{type:"adaptive"}

(valid for Sonnet); Haiku 4.5 rejects it.Three parameters must be normalized to route to Haiku, each surfacing as a distinct 400:

output_config.effort

xhigh

is Opus-4.7+-only: replaying an Opus request to Sonnet 4.6 gives 400: "does not support effort level 'xhigh'. Supported: high, low, max, medium."

)thinking:{type:"adaptive"}

{type:"enabled","budget_tokens":N}

. You {type:"disabled"}

, because Claude Code also sends context_management:{clear_thinking_20251015}

, which returns 400: "clear_thinking_20251015 strategy requires thinking to be enabled or adaptive."

budget_tokens

max_tokens

.With those fixed (a "smart" proxy), both models complete: feature built, all 87 tests pass, every request 200; rerouted-to-Haiku requests carrying Sonnet thinking blocks all succeeded. The lesson: the request body is interlocked — model name, thinking parameter, effort, and context-management strategy are co-designed. Naive per-request rewriting fails loudly (400), and even when fixed, forfeits the model-scoped cache.

Claude Code already routes correctly — and you should leave it alone.It calls Haikuitselffor background/auxiliary work while the main loop stays on your primary model (observed asorig_model=haiku

requests the proxy never touched). The right answer for almost everyone is tolet the built-in routing do its job and add no router at all.Don't insert your own routingunless you really know what you're doing— you'd have to reproduce the interlocked parameter normalization aboveandstill eat the model-scoped cache penalty, which usually costs more than it saves. If you have a genuine reason to route, routesticky(per task or per sub-agent),never mid-conversation— and treat that as an advanced move you measure yourself, not a default.

Per 1M tokens; read 0.1×, write 2× (1-hour TTL):

component Opus 4.8 Sonnet 4.6 Haiku 4.5
cache read $0.50 $0.30 $0.10
cache write (2×)
$10.00
$6.00
$2.00
output $25.00 $15.00 $5.00
(base input) $5.00 $3.00 $1.00

The per-turn cost is:

turn $ = read × read_rate + write × write_rate + output × output_rate

With rates baked in (output ×0.775 tokenizer factor for Haiku/Sonnet):

Opus $   = read × 0.00000050 + write × 0.00001000 + output × 0.000025
Sonnet $ = read × 0.00000030 + write × 0.00000600 + (output × 0.775) × 0.000015
Haiku $  = read × 0.00000010 + write × 0.00000200 + (output × 0.775) × 0.000005

Experiment A — tiny task (linked-list fix, 6 turns, Opus 4.8). Prefix ~41K, all warm. All-Opus: $0.199 (~$0.033/turn). A 10%-Haiku per-request bounce: +16% (the cold write dwarfs the tiny outputs). A 10%-Haiku sticky: −8%. Break-even output ≈ 2,500 tokens. (Claude Code self-reported $0.182 — its 1.25× write estimate.)

Experiment B — feature build (16 Opus turns, MCP stabilized). Usage: cache_read

1,067,378 · cache_write

53,449 · output

35,579 · input

2,502. Acceptance 27/27, 24 unit tests green.

Turn read write output Opus $
1 29,874 11,301 168 0.1321
2 41,175 1,544 13,466 0.3850 ✱
3 42,719 16,024 127 0.1848
4 58,743 517 5,802 0.1796
5 59,260 5,858 8,347 0.2969
6 65,118 8,611 446 0.1298
7 73,729 561 194 0.0473
8 74,290 283 500 0.0525
9 74,573 591 316 0.0511
10 75,164 405 466 0.0533
11 75,569 562 96 0.0458
12 76,131 477 3,232 0.1236
13 76,608 3,468 123 0.0761
14 80,076 1,269 1,574 0.0921
15 81,345 1,659 227 0.0629
16 83,004 319 495 0.0571
Total
1.970

✱ Turn 2 includes a one-off uncached-input term (2,472 × $5/M = $0.0124). The cross-check is instructive: the documented 2× rate totals $1.970, while Claude Code displayed $1.77 (its 1.25× under-count). The $0.20 gap = 53,449 × ($10−$6.25)/M ≈ $0.20 — exactly the extra write cost. This is the displayed-cost gotcha made concrete.

Experiment D — interleaved 25-turn routing (Opus / Haiku / Sonnet). 20% of turns (3, 9, 14, 19, 23) routed to a cheaper model in a single session; turn 3 is the first bounce (cold), the rest are catch-up writes. One table answers the question directly — what a 25-turn session costs when 20% of its turns go to a different model, versus staying put or going fully sticky:

Routing strategy Session cost vs all-Opus Verdict
All-Opus (no routing) $2.574 baseline
20% of turns → Haiku, per-request $2.278 −11.5%
modest win
20% of turns → Sonnet, per-request $2.628 +2.1%
❌ loses money
All-Haiku, sticky
~$0.41 −85%
biggest win
All-Sonnet, sticky
~$1.21 −53%
big win

Read it slowly — it's the punchline of the whole guide. Routing 20% of turns to Sonnet per-request actually loses money (+2.1%): Sonnet's write rate is 3× Haiku's, so on the cold and catch-up turns the parallel-prefix write costs more than the cheaper output saves (turn 3 alone runs +$0.09 versus just staying on Opus). Yet the same model run sticky saves 53%. That's the lesson: where you switch matters more than whether — bouncing per-request can cost you; keeping one cheaper model for the whole conversation is the real win.

Tool / class Attacks Net win when… Watch-out
RouteLLM, vLLM Semantic Router model tier routing is sticky per conversation/sub-agent per-request bouncing forfeits the model-scoped cache
Headroom input + cache the prefix is stabilized so cache reads survive retrieval round-trips cost some tokens
LLMLingua input one-shot / RAG context, not a cached prefix busts the cache; degrades past ~20×
RTK, lean-ctx tool-output input tool outputs are verbose (logs, installers, git )
Caveman output output volume dominates and terse prose is acceptable human readability
OpenAI Compaction, Compresr, Token Co. history / input provider or hosted flows hosted ones send data off-box; non-reversible

The single most reliable lever is protecting the cached prefix(Act I). Compression and routing help only insofar as they don't undo that.Output compression (Caveman) and sticky routing are the two that never touch the prefix— which is why they're the safest wins in the entire ecosystem.

── more in #large-language-models 4 stories · sorted by recency
── more on @headroom 3 stories trending now
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/claude-code-costs-ac…] indexed:0 read:22min 2026-06-26 ·