Cache-Aware Spawning: What Changed in llm-cli-gateway, a Week On

wpnews.pro

If your multi-LLM workload sends the same long system prompt or file dump to Claude / Codex / Gemini ten times an hour, you are paying for the same input tokens ten times. Each provider has a cache for exactly this case, and each one expresses the cache differently. This post is about how llm-cli-gateway now uses those caches for you, across all five providers, without you having to re-implement the per-provider cache APIs yourself. I covered the previous round of changes last week, and I closed that piece with a teaser, that Mistral Vibe was next on the list. A week later, Mistral is in, and a much larger change has landed alongside it, which is what most of this follow-up is about.

The new shape of the gateway: it now understands prompt caching as a first-class concern, across all five providers. That is claude

, codex

, gemini

, grok

, and mistral

(Vibe). v1.6.0 shipped today and contains the lot.

Short version: every *_request

and *_request_async

tool now accepts a structured promptParts

shape, the gateway concatenates the parts in a canonical order so the stable bytes precede the volatile tail unchanged across calls, three new cache_state://

MCP resources expose hit-rate / hit-count / estimated-savings aggregates back to the orchestrating agent, session_get

projects a compact cacheState

view at read time, and a cache_ttl_expiring_soon

warning fires on Claude resumes when the Anthropic cache breakpoint is within 30 seconds of expiry. All of it is opt-in (every flag defaults off in 1.x), all of it observes the per-provider cache mechanism rather than fighting it, and none of it adds conversation content to gateway storage.

Long version is below, organised the same way I organised last week's post, problem - what changed - what it now does, with the caveats named up front rather than buried.

Mistral shipped Vibe, their open-source CLI coding agent powered by Devstral 2. The gateway now wires mistral_request

and mistral_request_async

alongside the other four providers. Same shape as the rest, sessions through --resume

/ --continue

(which requires [session_logging] enabled = true

in ~/.vibe/config.toml

, the doctor surfaces this so you do not get an opaque failure), model registry entries, self-update via the vibe

binary itself, the same circuit-breaker, approval-gate, flight recorder, metrics, dedup, and durable-job-store plumbing as the others.

The model alias resolution is slightly different. Vibe has no --model

flag, so the gateway injects the resolved alias via VIBE_ACTIVE_MODEL

instead. That is the only material divergence from the Claude / Codex / Gemini / Grok pattern, and it is documented inline at the call site.

Now five providers, five model families, five vendor lineages (Anthropic, OpenAI, Google, xAI, Mistral). What I noticed running parallel reviews these past few weeks is that the three OpenAI / Anthropic / Google adjacent triangle agreeing on something is not as informative as it looks, because the three model lineages share a lot of training data and a lot of post-training tendencies. I am not pretending this is statistics, it is just how I use these tools in review work, but adding an xAI voice and a Mistral voice means a five-way agreement is sampled from a meaningfully wider distribution than a three-way agreement, and a one-out-of-five dissent (especially from the vendor-outside-the-triangle) is a data point I read rather than a vote I discard.

The change that took most of the engineering is promptParts

. The shape is small:

{
  "promptParts": {
    "system": "You are a careful reviewer of TypeScript diffs.",
    "tools":  "<long, stable description of the tools you can call>",
    "context": "<long, stable file dump or repo summary>",
    "task":    "What did the last patch change?"
  }
}

prompt

and promptParts

are mutually exclusive, you pass exactly one, the runtime check at the top of every handler returns the exact error message provide exactly one of promptorpromptParts``

if you pass both (the backticks belong to the error string itself; the messages are part of the public contract and the tests assert them verbatim). The gateway then concatenates the parts in canonical order, system

→ tools

→ context

→ task

, with a stable separator, and hands the resulting string to the CLI's positional -p

(or equivalent) argument. The stable prefix bytes precede the volatile task

tail unchanged across calls, which is enough for each provider's automatic prompt-caching to land on the same content hash each time.

Two specific points worth naming.

First, this is not a request-body translation layer. The gateway does not construct Anthropic / OpenAI / Mistral JSON request bodies; it spawns the CLI binary the same way it always has. The "cache awareness" sits one layer above, in how the input string is composed before the CLI sees it. That keeps the architectural thesis intact (CLI wrapping, not API proxying) while still giving you cache hygiene for free.

Second, for Claude specifically, the gateway does not yet emit explicit cache_control

JSON breakpoints. The Claude Code CLI documents --exclude-dynamic-system-prompt-sections

and several ENABLE_PROMPT_CACHING_*

/ DISABLE_PROMPT_CACHING_*

environment variables (all listed in PROVIDER_CACHE_SURFACES.md with citations to the upstream env-vars page), but the path for injecting per-block cache_control

markers via stream-json input is probable rather than verified. The [cache_awareness].emit_anthropic_cache_control

flag is reserved in config for the follow-up slice that lands a live smoke test, so the present 1.6.0 release ships "Branch B" (prefix discipline only). That is honest about what works and what is gated on verification.

Third (because I said two and meant three), per-model minimum cacheable token thresholds matter. Anthropic Sonnet 3.5–4.6 caches at 1024 tokens minimum; Opus 4.5+ and Haiku 4.5 require 4096; Haiku 3.5 on Vertex needs 2048. The gateway has a [cache_awareness.min_stable_tokens_for_cache_control]

per-family table populated from the Anthropic prompt-caching docs and surfaces the lookup via a minStableTokensForModel(config, modelName)

helper. The in-code alias table is conservative (it collapses all Haiku variants to 4096 rather than exposing the Vertex-only 2048 distinction); a single-family override can be added when a workload needs it. Slice 1 does not yet act on this (we are not emitting cache_control), but the data is in place for the slice that will.

The supporting piece, and frankly the one that makes the rest defensible, is the observability surface. Three new MCP resources sit alongside the existing sessions://

and models://

resources:

cache_state://global

total_requests

, total_hits

, hit_rate

, total_cache_read_tokens

, total_cache_creation_tokens

, estimated_savings_usd

(best-effort, using a per-model pricing table dated 2026-05-26

), and a per-CLI breakdown.cache_state://session/{sessionId}

ttlRemainingMs

derived from the configured Anthropic TTL policy.cache_state://prefix/{hash}

The structural guarantee: none of these shapes have a prompt

/ response

/ system

/ task

field. The session-storage invariant from the project's CLAUDE.md

("no conversation content in session storage") holds, and the new bits add only hash + token-count metadata to the existing flight recorder (which already stored prompts and responses for audit, separate from the session manager). I would not have shipped the observability surface without that constraint, frankly.

The session_get

tool now includes a compact cacheState

block when the session has prior requests, with cli

, prefixDistinct

, totalCacheReadTokens

, totalCacheCreationTokens

, requestCount

, hitCount

, hitRate

, estimatedSavingsUsd

, and ttlRemainingMs

. The field is omitted entirely for fresh sessions (not null, not empty object), keeping the payload compact when there is nothing to report.

Slice 3 is the bit that uses the observability data for actionable warnings. When claude_request

(or claude_request_async

) is invoked with a sessionId

, and [cache_awareness].warn_on_ttl_expiry = true

, and the prior session row's lastRequestAt

is within 30 seconds of Anthropic's documented TTL (5 minutes by default, 1 hour when [cache_awareness].anthropic_ttl_seconds = 3600

), the response payload carries a structured warning:

{
  "warnings": [{
    "code": "cache_ttl_expiring_soon",
    "ttlRemainingMs": 12000,
    "message": "Anthropic cache breakpoint for session ... expires in 12000ms (< 30000ms). Subsequent requests may miss the cache."
  }]
}

It is a warning, not a hard error. The request still runs. The flag defaults to false in 1.x; flip it on once you have observed your traffic for a few days. Two caveats. First, ttlRemainingMs

is best-effort, computed locally from our flight recorder's lastRequestAt

rather than from Anthropic's actual cache state, so a cache eviction inside Anthropic's window will not be visible to us, the warning may be optimistic. Second, it only fires for Claude. For the other four CLIs, we do not observe the provider's cache state (or, in some cases, the provider does not expose one at all), so the warning would be a guess.

The Codex CLI, however, deserves a specific note. As of 0.133.0, Codex emits cached_input_tokens

in its turn.completed.usage

payload, verified by a live smoke test on 2026-05-26 (the test invocation, the raw JSONL response, and the field-name divergence from the Anthropic-style cache_read_input_tokens

are all captured in docs/personal-mcp/PROVIDER_CACHE_SURFACES.md under the "Codex field name divergence" section; the gateway's

src/codex-json-parser.ts

was originally written against the Anthropic-style name). The parser's cache_read_tokens

column therefore stays null for Codex rows until a follow-up updates the parser to accept the actual field. The observability surface tolerates this without dividing by zero, and the limitation is also documented in the v1.6.0 also brings a much larger contributor-facing change that does not show up in any tool surface, but is worth naming. The gateway now ships with the same security and validation posture as our agent-assurance spec repository. A new .github/workflows/security.yml

runs actionlint, zizmor, shellcheck, typos, osv-scanner, gitleaks, ruff, bandit, and lychee on every push and pull request; eslint-plugin-security

is wired into the existing eslint config and runs as part of the standard CI lint step. All third-party actions are SHA-pinned; the Python and Go tools are version-pinned (zizmor==1.25.2

, ruff==0.14.5

, bandit==1.9.4

, actionlint@v1.7.12

); the gitleaks binary is downloaded and SHA256-verified before execution. Workflows now use least-privilege permissions, defaulting to contents: read

and escalating only on the publish jobs that need OIDC for npm provenance / PyPI trusted publishing or gh release upload

; every actions/checkout

sets persist-credentials: false

except the single job that needs the token for the release upload; the release-installer.yml

top-level write was narrowed to that one job. Dependabot expanded from github-actions only to also cover npm and pip, with non-security npm bumps grouped so security updates never get delayed behind a batch.

In flight, osv-scanner flagged 26 Go stdlib CVEs in installer/go.mod

(pinned to Go 1.22, when the fixes were in 1.23–1.25.x); that has been bumped to 1.25 in lock-step with the release-installer.yml

setup-go pin, and re-verified clean. Two test fixtures and one npmjs.com

URL needed allowlisting (a deliberate fake bearer token, an npmjs page that Cloudflare bot-protects, and a similar OpenAI help-centre page), each annotated with the specific reason. There are no real findings outstanding.

This is not the kind of work that ships in a marketing line. It is the work that means the next contributor (or me, six months from now) does not accidentally land a workflow with contents: write

and a published-to-cache setup-node

step on a release-triggered workflow, which is precisely the kind of supply-chain footgun the Solorigate, Codecov, and xz class of incidents has trained the industry to take seriously. It is the work that means a Dependabot PR with a real CVE fix gets reviewed against an automated gate, not a human's best guess. It is the work that makes claims about supply-chain hygiene auditable rather than aspirational.

The cache-awareness story above frames the gateway as something claude-code

or codex

spawns when an MCP request lands, but that is only one of three inbound surfaces, and it is worth being explicit about the other two because they are how a lot of people actually use the gateway day to day. The gateway is itself an MCP server, so anything that speaks MCP can reach it, and the cache-awareness, observability, and TTL warnings described above apply identically regardless of which surface called in.

claude-code

, codex

, gemini

, grok

, and vibe

each have their own MCP config (~/.claude.json

, ~/.codex/config.toml

, ~/.gemini/settings.json

, and so on); the gateway gets a single entry that wires llm-cli-gateway

as the command, and the inbound CLI then sees all of claude_request

/ codex_request

/ gemini_request

/ grok_request

/ mistral_request

plus the session and cache_state://

resources as if they were its own tools.setup/providers/claude-desktop.md

client_config.claude_desktop_config_present

field tells the install agent which path applies.llm-cli-gateway tunnel start

and llm-cli-gateway chatgpt-url

for the connector wiring; the doctor's endpoint_exposure.web_clients_supported

field is the gating boolean. The wrinkle worth knowing about is that ChatGPT requires Authentication: No Authentication

on the connector path, so the gateway's LLM_GATEWAY_NO_AUTH_PATHS

env var carves out exactly that path while keeping /mcp

bearer-token-gated. The walk-through is in setup/providers/chatgpt.md

llm-cli-gateway doctor --json

is the authoritative source for which of these surfaces are wired today, and the install-agent contract at setup/assistants/ASSISTANT_CONTRACT.md is the canonical walk-through, with per-target snippets under

setup/providers/

node

the gateway binary and an upstream CLI of your choice; the other four providers go in as and when you add them.Nothing, again. The thesis from the original piece was that CLI wrapping gives you capabilities (real file access, real test execution, real session state) that API proxying cannot reach without re-implementing each provider's tool surface. Cache hygiene now joins that list. Each provider's CLI is the right surface to ask "what does this cost?", because each provider's CLI is the only surface that returns telemetry the same way the operator's billing console returns it. The gateway's job is to compose the stable bytes before the volatile bytes so the cache lands on the same content hash, then to read back the resulting cache_read_input_tokens

(or cached_input_tokens

, depending on the CLI version) from the flight recorder and surface it as an MCP resource the orchestrating agent can act on.

What an API-proxy approach would have to do for the same outcome: construct provider-specific request bodies with per-block cache_control

markers, then handle the per-provider divergence in cache field names (cache_read_input_tokens

for Anthropic, prompt_tokens_details.cached_tokens

for OpenAI, usageMetadata.cachedContentTokenCount

for Gemini), then handle the per-provider divergence in TTL policy (5min/1h for Anthropic, implicit-only for OpenAI, separate cachedContents

SDK for Gemini), and own the resulting compatibility surface forever. We instead let each CLI own its own provider integration and stand back, sampling the telemetry as it comes out.

If you are evaluating llm-cli-gateway against an API proxy and your workload is heavy on long stable context (file dumps, repo summaries, large system prompts), the question to ask now is not just "does this give me cache hits?", it is "does this give me cache hits I can measure, without me having to re-implement per-provider cache APIs?". That seemed worth writing down.

The Branch A live smoke test for explicit Claude cache_control

injection via --input-format stream-json

. The Codex parser fix to accept cached_input_tokens

. Async-path flight-recorder integration, so the v3 stable_prefix_hash

column gets populated on async jobs too (it does not today, by design, because src/async-job-manager.ts

has zero flight-recorder integration, and that is a separate concern). And, once we have 24h of dogfooding data from cache_state://global

, the cache-aware multi-LLM routing slice, which is the actual end goal: route a request to the provider whose session has the warmest cache for the requested prefix, rather than the round-robin default.

v1.6.0 is the feature release described above; a docs-only follow-up v1.6.1 went out the same day with the install-agent guidance for Mistral and the post-release doc audit fixes (no source changes). The current published artefacts are at v1.6.1 on npm (with sigstore provenance via the OIDC publish path) and PyPI; the GitHub release at v1.6.1 carries SHA256-verifiable installer artefacts for macOS / Linux / Windows.

Thanks for reading this far. As always, MIT licensed.

llm-cli-gateway is MIT licensed. npm: llm-cli-gateway | GitHub: verivus-oss/llm-cli-gateway

source & further reading

dev.to — original article Vercel AI SDK 6: An Agent Is Just a while Loop I Built an AI Publishing Stack for $31/Month — Real Numbers After 15 Posts Two central banks just put a clock on AI risk in finance

Cache-Aware Spawning: What Changed in llm-cli-gateway, a Week On

Run your AI side-project on zahid.host