# We Measured LLM Prompt Caching in Production — Same Prompt, 0% to 91% Hit Rates

> Source: <https://dev.to/sm1ck/we-measured-llm-prompt-caching-in-production-same-prompt-0-to-91-hit-rates-oio>
> Published: 2026-05-28 08:21:47+00:00

We run an AI companion bot. Every chat turn, the model sees the same ~5K-token prefix — character persona, content-tier rules, formatting guardrails, a memory blob — plus one new user line. Without caching, we pay for those 5K input tokens *every single turn*. So we turned on prompt caching across the providers we route through, measured it, and the spread was bigger than any of the marketing pages prepared us for.

Here's the table that survived four weeks in production, plus the one gotcha that ate two weeks before we figured it out.

| Provider / model | Hit rate | Latency Δ | Notes |
|---|---|---|---|
| Cydonia (via OpenRouter) | 91 % |
−43 % |
Just works, no marker needed |
| Gemini 3.1 Flash Lite | 75 % |
−49 % |
Requires `cache_control` marker |
| Grok (xAI) | 51 % |
−40 % |
"Sticky" — best on active sessions |
| Same code, 600-token test prompt | 0 % | 0 % | Methodology bug — see below |

Same exact 5K-token system prefix across all rows. Same 10 follow-up turns. Wildly different cache behaviour.

Most OpenAI-compat examples skip any cache hint and assume the provider figures it out from prefix repetition. Some do. Anthropic-style routes — and anything going through OpenRouter that supports `cache_control`

— don't:

```
messages = [
    {
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": SYSTEM_PROMPT,          # the long, stable prefix
                "cache_control": {"type": "ephemeral"},
            }
        ],
    },
    {"role": "user", "content": user_msg},      # the only volatile part
]
```

Cydonia caches without it. Grok caches without it.

**Gemini 3.1 Flash Lite caches at exactly 0 % without it.** The same model jumps to 75 % with one extra field on the last cacheable content block.

We had Gemini 3.1 routed in production for a week showing zero cache reads in usage. Concluded the model "just didn't support caching." It does — we were calling the API the way every other model wanted to be called. Cost of including the marker on providers that ignore it: zero. Cost of skipping it on a provider that needs it: your entire spend on that route.

Before we caught the marker thing, we'd already wrongly concluded a couple of models "don't cache" — because we'd tested with the wrong prompt.

The first probe was a ~600-token prompt repeated 10 times. Cache reads: zero, across every provider. Conclusion: this provider doesn't cache.

Conclusion: wrong. Most providers have a minimum prefix length before caching kicks in (≥ 1K tokens for some routes, closer to ≥ 4K for others). Below that floor, you pay full price even though the prompt repeats verbatim. The cache simply doesn't engage.

The corrected probe:

`cache_control`

marker on the last cacheable content block.`usage.cache_creation_input_tokens`

and `usage.cache_read_input_tokens`

(or the provider's equivalent) back — don't trust round-trip latency alone.Once we did that, every "broken" provider started reporting cache reads.

Grok was the weird one. Hit rate 51 % — lower than Cydonia and Gemini — but the cache *survived longer* between calls. Other providers behaved like a ~5-minute ephemeral cache; Grok looked more like a hot-window-then-slow-decay curve. Practical consequence: Grok did *better* than its hit rate suggested when the same user kept chatting actively, and *worse* when they came back hours later.

Lesson — a single hit-rate number per provider lies a little. The shape (how it decays, how it warms) matters as much as the headline percentage when your traffic is bursty.

We route turns through different model tiers depending on the user's plan. After caching landed and the marker was wired in everywhere it was needed:

The pleasant surprise was that latency mattered to retention more than cost mattered to the P&L. Cheaper turns are nice; faster replies are felt.

`cache_control`

is required on some routes (Gemini 3.1 line via OpenRouter, in our case) and ignored by others. Always send it.`cache_read_input_tokens`

doesn't lie. End-to-end latency does — TTFB swings hide a lot of noise.If you're running an AI app where the system prompt dwarfs the user input — companion bots, RAG with chunky retrieved context, agentic loops — you almost certainly leave 40 % of your bill and half a second of latency on the table by trusting the defaults. The marker is one line. The corrected methodology is one afternoon.

If you've got hit-rate numbers from a different routing setup (Bedrock, Fireworks, Together, direct Anthropic), drop them in the comments — curious how the marker situation compares outside the OpenRouter ecosystem.

This write-up is from production work at ** HoneyChat** — a Telegram-native AI companion where the system prompt is the load-bearing wall (persona + content tier + memory blob = the whole 5K). The canonical version of this post lives at

— *HoneyChat Engineering*

`cache_control`

field reference, ephemeral cache, billing rates.`cached_tokens`

in usage.
