Designing Persistent LLM Agent Memory on Elasticsearch

Mariana Souza describes a three-index Elasticsearch architecture for persistent LLM agent memory that achieves 0.89 recall without tenant leaks. The system separates episodic, semantic, and procedural memory into distinct indices and uses a two-stage hybrid retrieval pipeline with BM25, dense vectors, and cross-encoder reranking.

AI https://www.devclubhouse.com/c/ai Article Designing Persistent LLM Agent Memory on Elasticsearch How a three-index architecture with hybrid retrieval and document-level security achieves 0.89 recall without tenant leaks. Mariana Souza https://www.devclubhouse.com/u/mariana souza Context windows are excellent scratchpads. They provide the active reasoning space required for a single inference session. However, relying on massive context windows to store long-term user history is an architectural anti-pattern. Beyond the obvious issues of latency and API costs, models frequently suffer from the "lost in the middle" effect, ignoring critical facts buried deep within a bloated prompt. To build truly autonomous assistants, engineers need a persistent, long-term memory layer that survives session boundaries, scales across years of interaction, and retrieves relevant context efficiently. By leveraging Elasticsearch https://www.elastic.co to build a multi-tenant agent memory layer, development teams can achieve an impressive Recall@10 R@10 of 0.89 across evaluation datasets while ensuring strict tenant isolation. Here is a look at the architectural patterns, retrieval pipelines, and state-management strategies that make this level of performance possible. The Cognitive Blueprint: Episodic, Semantic, and Procedural Memory Storing every raw user interaction in a single database index quickly turns your memory layer into an unmanageable haystack. Instead, a robust agent memory system should mirror cognitive psychology principles—specifically the COALA framework—by splitting memory into three distinct Elasticsearch indices, each with its own lifecycle, write rate, and aging rules. 1. Episodic Memory This index captures time-stamped events, logging each user turn exactly as it lands before any extraction or interpretation occurs. While much of this transactional history is short-lived and eventually decays, select entries serve as the foundational evidence used to construct more durable memories later. 2. Semantic Memory Semantic memory stores distilled, stable assertions about the user and their environment e.g., "User owns a Lumio Hub v2" or "Hub was reset in March" . These facts survive across sessions and serve as the primary grounding truth for the agent. Unlike episodic memory, semantic memory must be actively curated, deduped, and updated. 3. Procedural Memory Procedural memory houses multi-step playbooks and operational processes e.g., "How to troubleshoot Zigbee disconnects" . Each playbook document tracks performance metrics, specifically success count and failure count . When a user confirms whether a suggested fix worked, a consolidation loop increments these counters, providing critical context for an LLM to refine or replace the playbook over time. By separating these concerns into three indices, you avoid coupling write-heavy event logging with highly curated semantic facts, allowing each index to scale independently. Serverless Inference by DigitalOcean 55+ models, every modality. One API key, one bill. https://www.devclubhouse.com/go/ad/13 The Two-Stage Hybrid Recall Pipeline Retrieving the right memory at the right time requires handling both exact temporal queries e.g., "What did we try last time?" and conceptual queries e.g., "Why are my smart bulbs only showing white?" . To solve this, the architecture implements a two-stage hybrid retrieval pipeline. User Query │ ├──► BM25 Lexical Search ──────────┐ │ ▼ └──► Jina v5 Dense Vector ───► RRF Merging ───► Cross-Encoder Reranker ───► Agent Context Stage 1: Hybrid Retrieval with RRF First-stage retrieval runs a hybrid query combining lexical search BM25 and dense vector search, merged via Reciprocal Rank Fusion RRF . To keep the storage footprint flat and avoid complex synchronization, the system indexes incoming text twice from a single write. By using Elasticsearch's copy to mapping directive, raw text lands in the standard BM25 inverted index while simultaneously routing to a semantic text field. This field automatically generates dense vectors using Jina AI https://jina.ai embeddings specifically Jina v5 , ensuring that a single source-of-truth write powers both retrieval legs. Stage 2: Cross-Encoder Reranking Once RRF merges the candidates from the lexical and vector searches, a cross-encoder reranker evaluates the top results. This second-stage validation ensures that the most contextually relevant memories are prioritized before being injected into the LLM's prompt, filtering out near-misses that passed initial retrieval. Managing State: Supersession, Decay, and Security An agent memory layer cannot simply be an append-only log. It must handle contradictions, account for the passage of time, and guarantee multi-tenant security. Supersession over Deletion When a user contradicts a previously stored fact e.g., upgrading from iOS 17.4 to iOS 18 , the system should not delete the old record. Instead, it uses a supersession pattern. The old memory is marked as superseded by the new entry, preserving a clean audit trail of past states while ensuring only the active truth is surfaced during standard recall. Temporal Decay To prevent old, irrelevant facts from outranking fresh information, the retrieval query applies a decay function. However, this decay must be balanced: facts that the user interacts with or references frequently should have their relevance boosted, preventing highly active, older memories from sinking to the bottom of the results. Document-Level Security DLS In multi-tenant enterprise applications, preventing cross-tenant data leaks is paramount. Rather than relying on application-level filtering—which is prone to developer error—this architecture enforces isolation at the database layer using Elasticsearch’s Document-Level Security DLS . Each memory document is tagged with a tenant identifier, and the database restricts query execution to the authenticated user's scope. During evaluations across 168 complex test queries, this strict boundary achieved zero cross-tenant leaks. Future-Proofing with MCP To ensure this memory layer remains highly modular, the entire system can be exposed via the Model Context Protocol MCP . By decoupling the memory store from any single agent framework, any MCP-compliant client can seamlessly read and write to the episodic, semantic, and procedural indices. This establishes a clean, reusable infrastructure pattern ready to support the next generation of production-grade agentic workflows. Sources & further reading Mariana Souza https://www.devclubhouse.com/u/mariana souza · Senior Editor Mariana covers the fast-moving world of machine learning and generative AI, with a particular focus on how these technologies are reshaping development workflows. When she isn't stress-testing the latest foundation models, she's usually at a local hackathon. Discussion 0 No comments yet Be the first to weigh in.