# Designing Persistent LLM Agent Memory on Elasticsearch

> Source: <https://www.devclubhouse.com/a/designing-persistent-llm-agent-memory-on-elasticsearch>
> Published: 2026-06-18 17:04:07+00:00

[AI](https://www.devclubhouse.com/c/ai)Article

# Designing Persistent LLM Agent Memory on Elasticsearch

How a three-index architecture with hybrid retrieval and document-level security achieves 0.89 recall without tenant leaks.

[Mariana Souza](https://www.devclubhouse.com/u/mariana_souza)

Context windows are excellent scratchpads. They provide the active reasoning space required for a single inference session. However, relying on massive context windows to store long-term user history is an architectural anti-pattern. Beyond the obvious issues of latency and API costs, models frequently suffer from the "lost in the middle" effect, ignoring critical facts buried deep within a bloated prompt.

To build truly autonomous assistants, engineers need a persistent, long-term memory layer that survives session boundaries, scales across years of interaction, and retrieves relevant context efficiently. By leveraging [Elasticsearch](https://www.elastic.co) to build a multi-tenant agent memory layer, development teams can achieve an impressive Recall@10 (R@10) of 0.89 across evaluation datasets while ensuring strict tenant isolation.

Here is a look at the architectural patterns, retrieval pipelines, and state-management strategies that make this level of performance possible.

## The Cognitive Blueprint: Episodic, Semantic, and Procedural Memory

Storing every raw user interaction in a single database index quickly turns your memory layer into an unmanageable haystack. Instead, a robust agent memory system should mirror cognitive psychology principles—specifically the COALA framework—by splitting memory into three distinct Elasticsearch indices, each with its own lifecycle, write rate, and aging rules.

### 1. Episodic Memory

This index captures time-stamped events, logging each user turn exactly as it lands before any extraction or interpretation occurs. While much of this transactional history is short-lived and eventually decays, select entries serve as the foundational evidence used to construct more durable memories later.

### 2. Semantic Memory

Semantic memory stores distilled, stable assertions about the user and their environment (e.g., "User owns a Lumio Hub v2" or "Hub was reset in March"). These facts survive across sessions and serve as the primary grounding truth for the agent. Unlike episodic memory, semantic memory must be actively curated, deduped, and updated.

### 3. Procedural Memory

Procedural memory houses multi-step playbooks and operational processes (e.g., "How to troubleshoot Zigbee disconnects"). Each playbook document tracks performance metrics, specifically `success_count`

and `failure_count`

. When a user confirms whether a suggested fix worked, a consolidation loop increments these counters, providing critical context for an LLM to refine or replace the playbook over time.

By separating these concerns into three indices, you avoid coupling write-heavy event logging with highly curated semantic facts, allowing each index to scale independently.

[Serverless Inference by DigitalOcean 55+ models, every modality. One API key, one bill.](https://www.devclubhouse.com/go/ad/13)

## The Two-Stage Hybrid Recall Pipeline

Retrieving the right memory at the right time requires handling both exact temporal queries (e.g., "What did we try last time?") and conceptual queries (e.g., "Why are my smart bulbs only showing white?"). To solve this, the architecture implements a two-stage hybrid retrieval pipeline.

```
[ User Query ] 
      │
      ├──► BM25 (Lexical Search) ──────────┐
      │                                    ▼
      └──► Jina v5 (Dense Vector) ───► [ RRF Merging ] ───► [ Cross-Encoder Reranker ] ───► [ Agent Context ]
```

### Stage 1: Hybrid Retrieval with RRF

First-stage retrieval runs a hybrid query combining lexical search (BM25) and dense vector search, merged via Reciprocal Rank Fusion (RRF).

To keep the storage footprint flat and avoid complex synchronization, the system indexes incoming text twice from a single write. By using Elasticsearch's `copy_to`

mapping directive, raw text lands in the standard BM25 inverted index while simultaneously routing to a `semantic_text`

field. This field automatically generates dense vectors using [Jina AI](https://jina.ai) embeddings (specifically Jina v5), ensuring that a single source-of-truth write powers both retrieval legs.

### Stage 2: Cross-Encoder Reranking

Once RRF merges the candidates from the lexical and vector searches, a cross-encoder reranker evaluates the top results. This second-stage validation ensures that the most contextually relevant memories are prioritized before being injected into the LLM's prompt, filtering out near-misses that passed initial retrieval.

## Managing State: Supersession, Decay, and Security

An agent memory layer cannot simply be an append-only log. It must handle contradictions, account for the passage of time, and guarantee multi-tenant security.

### Supersession over Deletion

When a user contradicts a previously stored fact (e.g., upgrading from iOS 17.4 to iOS 18), the system should not delete the old record. Instead, it uses a *supersession* pattern. The old memory is marked as superseded by the new entry, preserving a clean audit trail of past states while ensuring only the active truth is surfaced during standard recall.

### Temporal Decay

To prevent old, irrelevant facts from outranking fresh information, the retrieval query applies a decay function. However, this decay must be balanced: facts that the user interacts with or references frequently should have their relevance boosted, preventing highly active, older memories from sinking to the bottom of the results.

### Document-Level Security (DLS)

In multi-tenant enterprise applications, preventing cross-tenant data leaks is paramount. Rather than relying on application-level filtering—which is prone to developer error—this architecture enforces isolation at the database layer using Elasticsearch’s Document-Level Security (DLS). Each memory document is tagged with a tenant identifier, and the database restricts query execution to the authenticated user's scope. During evaluations across 168 complex test queries, this strict boundary achieved zero cross-tenant leaks.

## Future-Proofing with MCP

To ensure this memory layer remains highly modular, the entire system can be exposed via the Model Context Protocol (MCP). By decoupling the memory store from any single agent framework, any MCP-compliant client can seamlessly read and write to the episodic, semantic, and procedural indices. This establishes a clean, reusable infrastructure pattern ready to support the next generation of production-grade agentic workflows.

## Sources & further reading

[Mariana Souza](https://www.devclubhouse.com/u/mariana_souza)· Senior Editor

Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon.

## Discussion 0

No comments yet

Be the first to weigh in.