We built a persistent agent memory layer on Elasticsearch with 0.89 recall Elastic has launched Agent Builder, a persistent memory layer for AI agents built on Elasticsearch, achieving a recall of 0.89 on a QA evaluation. The system uses three memory types from cognitive science, hybrid recall with RRF and a cross-encoder reranker, supersession for contradictions, and per-user DLS isolation to enable long-term memory across sessions. Agent Builder is available now GA. Get started with an Elastic Cloud Trial https://cloud.elastic.co/registration?utm source=agentic-ai-category&utm medium=search-labs&utm campaign=agent-builder , and check out the documentation for Agent Builder here https://www.elastic.co/docs/solutions/search/elastic-agent-builder?utm source=agentic-ai-category&utm medium=search-labs&utm campaign=agent-builder . Building agent memory on Elasticsearch Three indices, hybrid recall with a reranker, supersession, decay, and DLS. The architecture and the numbers behind a persistent memory layer for agents . Sarah's smart bulbs are only showing white. Her smart-home assistant suggests resetting the hub. She did that in March, and again last week; neither reset fixed anything. The agent doesn't know that, and it doesn't know about the dog chewing through her sensor cables either. The history that mattered, what worked, what didn't, and who Sarah is ended with each session. The standard workaround is to stuff prior context into the context window. That breaks down on cost, on latency , and on the well-documented " lost in the middle https://arxiv.org/pdf/2307.03172 " effect, where models ignore facts placed far from the prompt's edges. A 1M- token context window is a scratchpad. It is not a memory system. The context window is short-term memory: the active reasoning space for a single inference https://www.elastic.co/docs/explore-analyze/elastic-inference . What is missing is long-term memory: a persistent store that survives session end, scales to years of interaction, and lets you retrieve facts by content, by time, and by user. This post is about the architecture of a real agent memory system, built on Elasticsearch https://elastic.co/elasticsearch and structured around three categories from cognitive science https://alicekim.ca/17.CanPsy85.pdf , one hybrid recall query with RRF https://www.elastic.co/search-labs/blog/weighted-reciprocal-rank-fusion-rrf and a cross-encoder reranker, supersession for contradictions, and per-user DLS isolation. On a QA-style eval over 168 questions, R@10 averages 0.89 with zero cross-tenant leaks. The full implementation is on GitHub https://github.com/noamschwartz/atlas-memory-demo ; this post is about why it is shaped the way it is. What an agent memory store has to do A user asks "what fix did we try last time?" , a temporal query with an exact-match constraint. Or "Why are my smart bulbs only showing white?" , which needs personal memory blended with a shared catalog. Memory itself doesn't behave uniformly: events the user lived, stable facts about them, and step-by-step playbooks all have different write rates and aging rules, so the store has to recognize the type and treat each accordingly. And in any multi-user deployment, each user's memory has to stay invisible to every other user. Fresh events accumulate fast enough that they have to be consolidated into the durable kinds, or the index turns into a haystack. When a user contradicts a recalled fact, the old version has to be superseded rather than deleted, so the audit trail stays. Older facts shouldn't outrank fresh ones, and facts the user touches often shouldn't sink. And the whole memory layer should be reachable by any MCP https://www.elastic.co/search-labs/blog/mcp-current-state what-is-model-context-protocol- mcp ? -speaking client, not tied to one agent runtime. Splitting these across a vector store, a keyword engine, an audit layer, and a separate auth service means four things that can break and extra round-trips on every recall. The requirements describe a search engine, so this implementation uses one. The rest of this post walks through each. Three types of agent memory: episodic, semantic, procedural The first design decision is what categories of memory to store at all. Just saving everything builds a haystack with no signal. The cognitive-psychology split between episodic, semantic, and procedural memory https://alicekim.ca/17.CanPsy85.pdf , surfaced for LLM https://www.elastic.co/what-is/large-language-models agents in the COALA framing https://arxiv.org/pdf/2309.02427 , already has the right categories, and they map cleanly onto three Elasticsearch indices. Episodic memory. Time-stamped events: each user turn as it lands, before any extraction or interpretation. Most of it is short-lived: not always worth keeping. A few entries become evidence for durable facts later. Semantic memory. Distilled, stable assertions about the user. Sarah owns a Lumio Hub v2. Sarah is on iOS 17.4. Sarah's hub was reset in March. These survive across sessions and are what the agent grounds in. Procedural memory. Multi-step playbooks. How to troubleshoot Zigbee disconnects. Processes, not facts. Each carries success count and failure count , incremented by consolidation when the user confirms a fix worked or didn't. The counters are surfaced to the consolidation LLM as context when it considers whether to refine or replace a playbook. Each category has a different lifecycle . Episodic is written constantly and decays. Semantic is curated, deduped, and superseded as the user changes. Procedural accumulates outcome feedback success count , failure count that feeds consolidation. One bucket cannot model that. Three indices, one per memory type, let each follow its own write rate, its own aging rules, and its own update rules without coupling them. Alongside these three sits a fourth retrieval surface: world data already in Elasticsearch catalog, knowledge base . It is not "memory" in the cognitive sense, but the agent reads it through the same hybrid-retrieval pipeline covered in the next section , so it belongs in the same picture. The recall pipeline: hybrid retrieval with RRF and a reranker Memory is recalled with a two-stage hybrid search /search-labs/blog/hybrid-search-elasticsearch : RRF over BM25 + Jina v5 dense, then a cross-encoder reranker on the merged candidates. Each document is indexed two ways from one write: the raw text lands in the BM25 inverted index , and copy to routes the same value into a semantic text field that auto-generates Jina v5 vectors. Indexing the same content twice keeps the storage footprint flat: one source-of-truth write produces both retrieval legs . Each leg solves a different problem. BM25 anchors literal-token matches that an agent paraphrase would dissolve: version numbers, error codes, proper nouns like "Lumio Hub v2." Dense vectors catch the semantic shape of a question whose answer uses different words. Either leg alone misses cases that the other handles, and RRF fuses their rankings without having to calibrate BM25 scores against cosine similarities. https://github.com/noamschwartz/atlas-memory-demo/tree/main/backend/app/atlas/memory/mappings index mapping Over-fetch. A reranker can only re-order what it sees, so the candidate pool needs to be wide. The hybrid retriever fetches 80 candidates per leg and RRF-fuses with rank constant=30 tighter than the ES default of 60, so top-ranked items dominate more . rrf fetch https://github.com/noamschwartz/atlas-memory-demo/blob/main/backend/app/atlas/memory/operations.py L497 Reranker. A Jina v2 cross- encoder scores the merged candidates against the user query. Where BM25 and the bi-encoder dense both score query and document independently, a cross-encoder scores them jointly, with full attention across the pair, which is a stronger relevance signal at higher per-pair cost. That's what motivates the two-stage pipeline: over-fetch cheaply with the hybrid retriever, then rerank a small candidate pool with the more expensive scorer rerank https://github.com/noamschwartz/atlas-memory-demo/blob/main/backend/app/atlas/memory/operations.py L381 . One subtlety, shown in the diagram above. The agent's tool kit includes recall memory defined in tools.py https://github.com/noamschwartz/atlas-memory-demo/blob/main/backend/app/atlas/tools.py L25 , which the model calls during a turn. A single call fans across all three memory indices and the catalog at once: the agent doesn't pick a memory type, because the retriever's ranking and per-index decay handle routing on its behalf. The second subtlety is paraphrasing. Agents almost always rewrite the user's message before reaching for that tool, which strips literal version numbers, error codes, and proper nouns from the query before BM25 ever sees them. So every turn opens with an automatic pre-recall on the verbatim user message, injected into the conversation as if the agent had made the call itself agent.py https://github.com/noamschwartz/atlas-memory-demo/blob/main/backend/app/atlas/agent.py L128 . Writing and consolidating agent memory. Two operations move memory from "what just happened" into "what is durable about this user." Write. Every user turn writes one episodic event ID, exact message, timestamp and more before the LLM responds. The ID is assigned by Elasticsearch on write, the DLS query on Sarah's API key keeps the doc scoped to her on every subsequent recall, and the timestamp is what the time-decay function below reads to rank the event against newer ones. Agent replies aren't stored. The conversation history already carries them into the next call, and their length drowns out the short, fact-rich things the user said. Hot-path writing is a deliberate choice. Two alternatives sound plausible at first. Letting the context window carry the new fact forward works for the rest of an open session, but the moment the session ends or crashes, in-context state vanishes; cross-session memory was the whole point. Batching writes at session end preserves cross-session state, but it breaks two same-turn patterns this implementation depends on. A user mentioning a new device and asking for their device list inside one message needs the new fact to be visible to the recall that runs later in the same turn, because tool calls query the index, not the conversation history. And the supersession flow writes a corrected fact and recalls it against it inside one tool-call batch. Either pattern would silently misbehave under deferred writes. The cost we pay instead is one Elasticsearch write per user message, which is sub-100ms at the volumes a single conversation produces. Which advice worked is captured separately, by success count / failure count on the procedural index, not by storing the response prose. Recent episodes containing user confirmation "thanks, that worked" trigger success count++; explicit rejection "that didn't help" triggers failure count++. The conversation itself is the feedback signal, with the consolidation LLM as the classifier. No thumbs-up widget is required. Disagreement also surfaces a refined steps field for the LLM to write back into the playbook. Consolidate. Episodic logs accumulate fast. Consolidation promotes them into semantic facts and procedural playbooks that survive after the conversation history is gone. This implementation runs it every turn, so you can watch the inspector update live; in production the right cadence is a background job: every 24 hours, or when a user's episodic index crosses N new events. Per-turn doubles LLM calls per message. In one call prompt https://github.com/noamschwartz/atlas-memory-demo/blob/main/backend/app/atlas/consolidate.py L24 , the consolidation LLM is handed recent episodes plus existing facts and playbooks, and asked for three things: New semantic facts , with supporting episode ids for provenance. New procedural playbooks , when a multi-step resolution doesn't match any existing trigger. Procedural updates , success count++ / failure count++ based on whether the user confirmed the fix, plus refined steps when they disagreed. The prompt requires supporting episode ids on every output, so a sparse turn returns an empty list and writes nothing. Dedup uses the same hybrid retriever the agent uses for recall: for each candidate fact, a top-K hybrid search against the user's semantic index narrows the comparison set, and only those candidates go to the LLM for a meaning judgment. Two further guards bracket the output: candidates below a confidence floor are dropped, and an accepted fact whose top similarity hit clears ≥ 0.90 is treated as a duplicate. In this implementation, dedup is simpler: the most recent ~50 facts are passed to the consolidation LLM with a "do not duplicate" instruction, and the post-LLM confidence and similarity guards aren't wired yet. The hybrid-recall path and the bracketing guards are the production architecture; this snapshot relies on the LLM doing the comparison directly because the corpus is small enough that it fits. success count and failure count close a feedback loop on playbooks: across enough conversations, the same field that records "this worked" becomes the signal for "show this one first next time." Today, the counts are written but not yet biased into retrieval ranking. On a handful of resolved tickets, the boost is statistical noise. Wired into production, once a deployment has the density to make the signal meaningful. How agent memory handles contradictions and supersession Memory that only ever adds, never removes, ends up wrong. A user says "I moved to Edinburgh" ; the agent writes a new fact. Six months later, the old "lives in Bristol" fact is still in the index. Both surface on every recall, and the agent either picks the wrong one or hedges. Trust dies fast. The fix is one rule in the system prompt full prompt https://github.com/noamschwartz/atlas-memory-demo/blob/main/backend/app/atlas/agent.py L32 , no new tool. Instead of deleting, the agent supersedes : A worked example. Sarah's last visit recorded id=abc , " Sarah lives in Bristol " in the semantic index. Three months later, she opens a chat: "we left Bristol, in Edinburgh now." 1. Recall. The pre-recall on Sarah's message returns hits including { id: "abc", text: "Sarah lives in Bristol", memory type: "semantic" }. 2. Detect . The agent sees the conflict between the recalled fact and the new message. 3. Classify . "We left Bristol, in Edinburgh now" is a natural update, not a denial. The agent picks contradiction="natural" . 4. Write . The agent calls . Two things happen in one shot: write memory https://github.com/noamschwartz/atlas-memory-demo/blob/main/backend/app/atlas/memory/operations.py L55 text="Sarah lives in Edinburgh", supersedes id="abc", contradiction="natural" - A new doc id=xyz is written at full confidence no penalty, because the contradiction was natural . - The old doc abc is updated with superseded by=xyz, superseded at=