Memory Poisoning in Agentic RAG: The Attack Nobody Is Defending Against

wpnews.pro

Series: Weekly AI/ML Deep Dives — Week 5 of 12

Reading Time: ~13 minutes

Tags: RAG

LLMs

Security

Agentic AI

Memory Poisoning

NLP

Research

"We spent years making AI systems smarter. We forgot to make them suspicious."

In Week 4, we discussed how Retrieval-Augmented Generation transformed LLMs by giving them access to external knowledge at inference time. RAG systems became more factual, more updateable, and more reliable.

But there is a darker side to this architecture that the research community is only beginning to take seriously.

Agentic RAG systems do not just retrieve from static knowledge bases. They learn from experience. They store past interactions, successful reasoning traces, and task outcomes in long-term memory. When a new task arrives, they retrieve relevant past experiences and use them to guide current behavior.

This is powerful. It is also a significant vulnerability.

If an attacker can plant false memories in that system, the agent will trust those memories the same way it trusts legitimate ones. It will learn from fabricated experiences. It will repeat behaviors that were never actually successful. And it will do all of this without any indication that something has gone wrong. This is memory poisoning. As of early 2026, we do not have a fully reliable way to stop it.

![Memory Poisoning in Agentic RAG — By the Numbers]

Before getting into specific attacks, it helps to understand why Agentic RAG systems are vulnerable in the first place.

A standard RAG system retrieves from a fixed knowledge base that is controlled and relatively static. Poisoning it requires direct access to that knowledge base.

An Agentic RAG system is different. Its memory grows dynamically with every interaction. Every task the agent completes, every reasoning trace it produces, every outcome it observes gets written back into memory. This memory then influences future behavior.

The attack surface is not a static database. It is a continuously growing self-updating store of experiences that the agent treats as ground truth.

Three properties make this particularly dangerous.

First, agents apply a semantic imitation heuristic. When facing a new task, they retrieve past experiences that seem relevant and repeat what previously worked. This is rational behavior in a safe environment. In a compromised one, it means the agent will faithfully repeat whatever the attacker wanted it to learn.

Second, memory entries are not verified for provenance. The agent cannot distinguish between a memory it formed through legitimate task completion and one that was planted by an attacker. Both look identical at retrieval time.

Third, poisoning is self-reinforcing. Once a malicious behavior enters memory and gets executed, the agent may record that execution as another successful experience. The poisoning compounds over time.

MemoryGraft, published by researchers at the University of Georgia in December 2025, was one of the first papers to systematically study indirect memory poisoning in LLM agents.

The attack works through a benign-looking file. An attacker provides a README or documentation file that appears entirely normal. Hidden within it are executable code and fabricated successful experiences formatted to match the agent's memory structure.

When the agent processes the file, it executes the hidden code and writes the poisoned entries into its memory. No trigger phrase is needed. No special access is required. The attacker only needs the agent to read a file.

What makes MemoryGraft particularly effective is how it exploits dual retrieval channels. Most Agentic RAG systems use both lexical retrieval (BM25) and semantic retrieval (FAISS) simultaneously. MemoryGraft crafts poisoned entries that surface through both channels at once.

The results were striking. In experiments using MetaGPT's DataInterpreter with GPT-4o, just 10 poisoned records captured approximately 48% of all future retrievals. The poisoning persisted across sessions until manually purged.

Where MemoryGraft requires file access, MINJA requires nothing more than normal user interaction.

MINJA, published in early 2025, demonstrated that an attacker with no special privileges could inject malicious memories into an LLM agent simply by crafting specific queries during ordinary use. The agent processes the query, generates a response, stores the interaction in memory, and the poisoned entry is now part of the agent's experience.

What makes MINJA significant is the attack surface it reveals. MemoryGraft requires the agent to process an external file. MINJA requires only that the agent have a conversation. In any deployed system where multiple users interact with a shared agent, every user interaction becomes a potential injection vector.

MINJA achieved a 95% injection success rate in controlled experiments. The injected memories influenced subsequent agent behavior in ways that were difficult to attribute to any specific cause, making detection particularly challenging.

Both attacks exploit the same fundamental property: agents trust their memory without verifying where it came from. The mechanism differs. The outcome is the same.

A-MemGuard, published in late 2025, is the most comprehensive defense framework proposed to date. It introduces two core mechanisms.

When a query arrives, A-MemGuard retrieves multiple relevant memories and generates parallel reasoning paths from each one. If one reasoning path diverges significantly from the others, it is flagged as anomalous and removed from the validated memory set before the agent uses it.

The insight behind this approach is elegant. A poisoned memory may appear legitimate when examined in isolation, but it will produce reasoning that conflicts with what legitimate memories suggest. Consensus reveals the outlier.

In experiments across three attack scenarios, A-MemGuard reduced attack success rates by over 95% in several configurations. Against direct injection, success rates fell from 100% to 2.13%. Against MINJA-style indirect injection, reductions exceeded 60%.

A-MemGuard also introduces a separate lesson memory alongside primary memory. When an anomaly is detected, the flawed reasoning is recorded as a negative lesson rather than discarded. Future queries check the lesson memory first, preventing the agent from repeating the same mistake even if a similar poisoned entry re-enters primary memory.

This breaks the self-reinforcing loop that makes memory poisoning persistent. Rather than simply deleting bad entries, the system learns from them.

Despite these results, A-MemGuard has significant limitations.

It requires direct memory instrumentation. In systems where memory is managed through a black-box API, the framework cannot be applied. Most commercial deployments fall into this category.

It has not been tested on multi-step Agentic RAG pipelines where the agent reasons across multiple retrieval rounds before producing an output.

Most critically, A-MemGuard operates after retrieval. It catches poisoned entries when they are about to be used. It does not catch them when they enter memory in the first place.

Reading MemoryGraft, MINJA, and A-MemGuard together, a consistent pattern emerges. Each paper acknowledges the same limitation in its future work section.

MemoryGraft points to early-stage detection mechanisms as an open problem. MINJA calls for robust defense against realistic black-box deployments. A-MemGuard explicitly states that early-stage contamination detection at injection time is still missing.

Three independent research groups working on different aspects of the same problem all arrive at the same gap.

![Memory Poisoning Attack Flow] The distinction matters. Post-retrieval defense catches poisoned entries when they are retrieved for use. Early-stage detection would catch them when they are written into memory, before they ever influence a single reasoning step.

In a multi-step Agentic RAG system, this difference is significant. If a poisoned entry enters memory at step one, post-retrieval defense might catch it when it surfaces at step three. But steps one and two have already been influenced. The reasoning chain has already been shaped by contaminated information.

Early-stage detection would prevent this entirely.

All three papers focus on single-agent single-step settings. In a multi-step Agentic RAG pipeline where the agent retrieves, reasons, retrieves again, and reasons again across multiple rounds, we do not have a clear picture of how poisoning propagates between steps.

Does a poisoned entry at step one corrupt all subsequent steps? Does it corrupt only topically related steps? Can its influence be isolated? These questions remain unanswered.

Current defenses operate at retrieval time. No published work has demonstrated reliable detection at write time, the moment a new entry is being added to memory.

Write-time detection would be more efficient. It would catch contamination before it ever influences reasoning rather than after it has already been retrieved. The challenge is that poisoned entries are designed to look legitimate at write time. Detecting them requires understanding not just the entry itself but its potential influence on future reasoning.

MemoryGraft measured how many poisoned entries were retrieved. A-MemGuard measured attack success rates. Neither work quantifies the actual downstream impact of a successful poisoning event on task quality or system reliability.

Without severity metrics, it is difficult to prioritize defenses or make principled engineering decisions about acceptable risk.

A-MemGuard was tested on general-purpose agent tasks. Whether consensus-based validation performs equally well in specialized domains where legitimate reasoning paths may naturally diverge more has not been studied.

Memory poisoning is not a theoretical concern. It has been demonstrated with high success rates across multiple attack vectors using nothing more than file access or ordinary conversation. The defenses that exist are meaningful but incomplete.

The field has characterized the attack well. It has proposed initial defenses. What it has not done is close the gap between when poisoning enters a system and when current defenses can detect it.

In multi-step Agentic RAG systems, that gap is where the real damage happens.

Next week I will share results from my own experiment simulating early-stage memory poisoning in a RAG-based recommendation system and testing a detection mechanism before contaminated entries can propagate.

This is part of a weekly series on AI/ML research. Each post covers theory, recent work, and open problems.

*Connect on LinkedIn | Follow on Dev.to (https://dev.to/soohan_abbasi)|%7C)

source & further reading

dev.to — original article Why I'm Still Writing How-Tos From Open Source to Paid Product: Is AI Accelerating the Shift? Not All Repair Helps: What I Learned Trying to Fix a Failing AI Agent

Memory Poisoning in Agentic RAG: The Attack Nobody Is Defending Against

Run your AI side-project on zahid.host