# Cache Poisoning

> Source: <https://pub.towardsai.net/cache-poisoning-6044423bf94f?source=rss----98111c9905da---4>
> Published: 2026-06-23 18:01:01+00:00

Yo!! A quick note before we start. I have yapped a lot about prompt caching in [Part 1](https://medium.com/@akileshramesh2003/6044423bf94f), how it works, what cache hits and misses actually mean, and how to improve cache hit rates without making your API bill shoot the sky.

That one will definitely be useful before reading this Part 2!! Because here we are moving from the strict cache to the slightly more dangerous one.

I thought I was done with caching after Part 1. Not done-done obviously, because in software nothing is ever done! One day you are fixing cache hits, next day some JSON key changes order and the system behaves like it has never met you in its life. But still, I felt like I had understood the shape of it. Prompt caching made sense to me now!!

Same prefix? Reuse the work. Different prefix? Full prefill, full cost, full pain.

Painful, but honest :)

And honestly, that honesty made me comfortable. The cache was strict. One extra space, but at least I knew what kind of creature I was dealing with. It was not trying to be smart. It was not guessing. It was just checking whether the beginning was exactly the same.

Then I looked at something called as **Semantic caching**.

And my first reaction was very predictable, wait, this is better, no???

Instead of the context being the problem,

what if the cache understood meaning?

If one user asks, **How do I reset my password**?

and another asks **I forgot my password, how do I log back in?**

why are we calling the LLM twice ? Same intent and same answer probably. Embed the query, search nearby old questions, return the cached answer if the similarity is high enough.

Lower latency → Lower cost →Fewer repeated LLM calls → stakeholders happy → ( please do fill in )

I mean, I thought this is very brilliant, because the moment I moved this idea from a simple chatbot into an agentic system, the whole thing started feeling different. In a normal chatbot, a cached answer is mostly just text. Maybe wrong text, maybe stale text, but still text.

In an agentic system, a cached answer can become a plan. And a plan can become a tool call.

And a tool call can approve something, rotate something, update something, delete something, or confidently tell a user that the system did the right thing when the model did not even run for that request.

That is the part that stayed with me and hat is where close enough becomes dangerous.

Prompt caching reused computation while semantic caching can reuse conclusions!

So in this **Part 2 journey**, I want to take you into that world. The world where semantic caching looks like a cost-saving trick at first, then slowly becomes a question about trust, permissions, poisoned answers, and agentic systems that can act on old conclusions.

Okiee, this is how I understand

A first implementation usually looks something like this

``` python
def answer_question(query: str):    query_embedding = embed(query)    cached = semantic_cache.search(            embedding=query_embedding,            threshold=0.88,        )        if cached:  # this is the trap            return cached.answer        docs = retrieve_docs(query)        answer = call_llm(query=query, context=docs)        semantic_cache.write(            query=query,            embedding=query_embedding,            answer=answer,        )     return answer
```

Seems to be clean actually.

In a docs chatbot, this may be fine.

If the question is **where is the API reference?** and the cached answer points to the same public docs page, no probs.

But…But in an agent, this is where the floor starts moving, the whole problem is hiding in one line:

```
if cached:    return cached.answer
```

That line looks like normal caching, but it is not the same thing as prompt caching…

Prompt caching is like, **I already read this exact prefix before, so I will reuse the computation**.

Semantic caching says**, I found an old answer that looks close enough, so I will reuse the conclusion.**

One saves reading work. The other can skip reasoning. And in an agentic system, skipping reasoning is not always a cute latency win… Sometimes it means retrieval does not happen. The latest policy is not checked. The current user context is not considered. The old answer just comes back wearing today’s clothes. That is the bug.

I did not want this to be a full looong collection of words blog.. Yeah..** Semantic cache poisoning** sounds serious, but until you see it happen in a small system, it still feels a little far away. So I experimented with a small demo that I ran entirely locally.

**Description** : Just a fake DevOps planner agent, a semantic cache, one poisoned entry, one normal-looking user query, and enough logs to see what happened. The fake domain was DevOps because agentic risk becomes very obvious there. The agent receives operational tasks and produces a plan. In the vulnerable version, it checks semantic cache first. If it finds a similar old plan, it returns that plan directly.

Here is the vulnerable planner:

``` python
def vulnerable_planner_agent(task: str, tenant_id: str):    task_embedding = embed(task)    cached_entry, similarity = semantic_cache.search(            query_embedding=task_embedding,            threshold=0.87,        )        if cached_entry:            return {                "source": "semantic_cache",                "similarity": similarity,                "plan": cached_entry.answer,                "llm_called": False,            }        fresh_context = retrieve_current_runbook(task, tenant_id)        fresh_plan = call_llm_for_plan(task, fresh_context)        semantic_cache.write(            query=task,            embedding=task_embedding,            answer=fresh_plan,        )        return {            "source": "fresh_llm",            "plan": fresh_plan,            "llm_called": True,        }
```

Now I inserted a poisoned cache entry:

```
[ATTACKER / LOW-TRUST QUERY]> How do I rotate staging database credentials in the legacy cluster?[CACHED PLAN]1. Use legacy_rotate.sh.2. Disable validation for staging.3. Run the manual secret update flow.
```

Then with another user, I asked:

```
[VICTIM QUERY]> Rotate staging database credentials for the new cluster.
```

And the vulnerable system returns,

```
- semantic_cache_lookup = true- top_match_similarity  = 0.913- matched_entry         = legacy_cluster_rotation_plan- llm_called            = false---RETURNED PLAN:1. Use legacy_rotate.sh.2. Disable validation for staging.3. Run the manual secret update flow.
```

This is the moment the idea becomes real.

The victim did not ask for legacy tools. The victim did not ask to disable validation. But the semantic cache saw enough overlap around **rotate, staging, and database credentials,** **crossed the threshold**, and returned the old plan.

The model did not hallucinate nor did not ignore instructions. The cache became the answer path.

```
Poisoned Cache Entry        ↓Victim Query with similar meaning        ↓Similarity Score: 0.913        ↓Threshold: 0.87        ↓Cache Hit        ↓LLM Skipped Entirely        ↓Stale Plan Returned
```

That is why semantic cache poisoning feels different from normal prompt injection.

Prompt injection attacks the model while it is reasoning. Semantic cache poisoning attacks what gets stored, so that later the model can be skipped entirely.

In a single chatbot, a cached answer is usually just text. Bad text can still hurt, but it often stops at the user reading a wrong answer.

In the recent past, with the multi-agent system, a cached answer can become intermediate reasoning!!!

A workflow may look like this:

```
## User request> Router agent: where should this go?> Planner agent: what steps should we take?> Research agent: what docs should we retrieve?> Tool agent: what API should we call?> Critic agent: is this safe?> Final response
```

Now imagine semantic caching sitting in the wrong place.

If it sits before the router, a poisoned entry can influence routing or if it sits before the tool agent, it can influence which tool gets called. The planner may not know its output was cached. The tool agent may not know the plan came from a stale cache hit. The critic may not know the decision came from the fast path. Each agent just sees content and acts on it.

The weakness is not one huge dramatic bug. It is a chain of quiet trust assumptions.. Cached responses travel through the system.

Semantic caching relies on **similarity and most setups** use something called **cosine similarity,**

Cosine Similarityis one of the most widely used metrics in modern AI systems for measuring how similar two vectors are in high-dimensional space.

```
similarity(a, b) = (a · b) / (||a|| · ||b||)
```

This gives a score between -1 and 1, where 1 means the vectors point in the same direction. In practice, teams often set thresholds somewhere around the 0.85 to 0.95 range depending on the embedding model, domain, and how aggressively they want cache hits.

The core problem is this, **cosine similarity measures directional closeness. It does NOT measure whether something is safe to reuse.**

Two queries can be similar for completely different reasons:

> **How do I reset my password? **and **I forgot my password, how do I log back in?** are similar in meaning. Returning the same cached answer is probably fine.

**> Reset staging database credentials** and **Reset production database credentials** are similar in meaning. But returning the same cached answer could destroy your company.

This is the core vulnerability, **similarity is continuous, but authorization is binary.**

You cannot represent a yes/no permission boundary with a smooth number between 0 and 1 and then act surprised when things leak through.

Here’s the dilemma that has no good answer:

Because the real question isn’t:

Are these two queries similar?

The real question is:

Is this cached answer allowed to be reused in this specific context?

And that is a question similarity cannot answer. It needs **system design**, not just a magic threshold.

A common **fix** is to have the LLM validate the cached answer:

```
if risk == "medium" and cached:    return llm_validate_and_adapt(        cached_candidate=cached.answer,        fresh_context=fresh_context,    )
```

This sounds safe, but it is not...

Recent research shows that **LLMs have a problem with context bias**. When you present information first (like here’s a cached answer that might work ), the model accepts it more often than when you present it neutrally. The model doesn’t carefully check the answer, it pattern-matches.

This is especially true if the cached answer was written by the same LLM before. The model will recognize its own style and trust it more, even if it is wrong.

So.. validation is not magic. If the cached answer is presented badly, you may simply anchor the model toward accepting it.

These are practical points that help you keep semantic caching without breaking your system,

A cache entry needs way more than just query and answer:

```
@dataclassclass SaferCacheEntry:    entry_id: str                           # Unique ID    query: str                              # Original query    answer: str                             # Cached response    embedding: np.ndarray                   # Vector representation        environment: str                        # Where was this created? (staging/prod)    source_versions: dict[str, str]         # What docs/policies existed then?    created_timestamp: datetime             # When was this created?        user_trust_score: float                 # How much do we trust who created this?
```

**Why?** Without metadata, the cache cannot answer the only question that matters:

Is this reusable here?

Without metadata, it is juz an old answer with no passport.. The system has no idea where it came from or who was supposed to use it.

Do not retrieve globally and then negotiate with similarity. Search inside the allowed boundary,

``` python
def safe_semantic_search(query_embedding, request):    results = vector_collection.query(        query_embeddings=[query_embedding],        n_results=5,        where={            "$and": [             # Only from the same environment                {"environment": {"$eq": request.environment}},                                # Only if the source docs match current versions                {"source_version": {"$eq": request.source_version}},            ]        },    )    if not results["ids"]:        return None    best_distance = results["distances"][0]    best_similarity = 1 - best_distance    if best_similarity < 0.87:        return None    return SafeCacheEntry(**results["metadatas"][0])
```

**The key principle : :** Similarity search should happen inside a safe boundary, not across the entire universe. Filter first, then match.

Not every query deserves the same cache treatment. For low-risk stuff, like public docs, FAQs, or where is the API reference?, direct cache return is usually fine because the **blast radius** is small.

For medium-risk tasks, like deployment guidance or config recommendations, the cached answer should only become candidate context, the model still runs, fresh context is retrieved, and the cache supports the answer without deciding it.

But for high-risk areas like payments, permissions, production infrastructure, database writes, credential rotation, private user data, or security operations, semantic cache should not be used as the final answer path. Those need **fresh reasoning**, policy validation, and proper tool checks.

TTL alone is not enough. A cached answer can become wrong even if the query match is perfect.

**Example**: You cache an answer based on the current refund policy. Six months later,

**TTL says** how long the answer lives and **Version invalidation says** whether the world it depended on still exists. Both indeed matters!!

In multi-agent systems, cached answers can travel silently:

Each agent just sees content and acts on it. Nobody knows the cache did anything.

**This is how systems become fragile.**

Track it explicitly,

```
class AgentState(TypedDict):    user_request: str        # Cache trust tracking    cache_hit: bool                         # Was cache involved?    cache_entry_id: str | None              # Which cache entry?    cache_similarity: float | None          # How close was the match?    cache_revalidated: bool                 # Did we re-validate it?        # Agent decisions    router_decision: dict                   # Where did request go?    planner_output: str | None              # What plan was created?    tool_calls: list[dict]                  # What actions were taken?    final_answer: str | None                # What did user get?
```

Now downstream agents can make informed decisions. And the planner knows, **I received a cached candidate, not fresh reasoning**.

**Invisible infrastructure causes fragile systems. Make it visible.**

Every cache hit should leave a trail, I’m assuming this is the default in each systems that are designed

```
cache_event_log = {    "timestamp": "2024-01-15T10:23:45Z",    "request_id": "req_abc123",    "cache_event": {        "hit": True,        "entry_id": "legacy_cluster_rotation",        "similarity_score": 0.913,                "authorization_checks": {            "environment_match": False,  # ← Problem!        },                "decision": {            "direct_return": False,            "reason": "environment_mismatch + high_risk",            "action": "called_llm_instead"        }    }}
```

When a user reports The system gave me a very weird answer, you need to know,

**A quick alert on these patterns..**

**Juz remember, **With normal caching, a high hit rate is good news. With semantic caching, a high hit rate can mean optimization, OR it can mean your system is reusing too many conclusions. Context matters.

This is one of the prevention line that changed my thinking,

If a low-trust user can create a cache entry that affects other users, they have written into future behaviour.

A single user creates a cache entry. Later, 100 users get served that same answer. One person’s mistake (or attack) becomes everyone’s problem.

**Key practices:**

Semantic cache poisoning is related to a larger problem called **memory poisoning** in agent systems.

A semantic cache and agent memory aren’t identical, but they rhyme in a dangerous way:

Something untrusted gets stored → Later, something normal retrieves it → The system behaves differently because of it.

There’s recent work on **sleeper memory poisoning** where adversarial content causes an assistant to store a fabricated memory, and that memory re-appears later across future interactions. The scary part isn’t that the attack works immediately. **The scary part is that it can sleep inside memory and activate later when the request looks completely normal. **[Do give it a shot if you are interested to read more](https://arxiv.org/html/2605.15338v2)

Semantic cache poisoning has the same pattern:

This is what makes this class of bug so dangerous. **It doesn’t always look like the system is breaking. Sometimes it looks like the system is working faster.**

The hard truth which I kinda found was** there is no perfect solution here.**

You cannot just set a threshold and be safe!! or.. You cannot just validate with an LLM and be safe.

**What you CAN do ( Like what’s in our power to do )**

For **repeated low-risk questions**, semantic caching makes complete sense. If users ask the same public docs question all day, please do not make the model reread it like it is doing character development.

But once semantic caching enters an agentic system, the responsibility changes.

You are not just optimizing a response. You may be optimizing the path that creates plans, chooses tools, retrieves context, or triggers actions.

This is the part I kept coming back to while writing this.

> Prompt caching felt strict, almost boring. It reused computation. Same prefix, same work, reuse the internal state.

> Semantic caching is softer. It says this old answer looks close enough. And honestly, that softness is exactly why it is useful.

But also exactly why it is dangerous, because close enough for cost is not the same as close enough for trust.

So yes, I still like semantic caching. I just do not want it sitting silently in front of an agent with the power to act, pretending similarity is the same thing as permission.

The cache can help and it can suggest. But the moment tools, permissions, user data, policies, or production systems are involved, the cache should not be the final authority. That is the line I would not cross casually!!

[Cache Poisoning](https://pub.towardsai.net/cache-poisoning-6044423bf94f) was originally published in [Towards AI](https://pub.towardsai.net) on Medium, where people are continuing the conversation by highlighting and responding to this story.