How to Build a Hybrid AI Memory System for Claude Code: Storage, Injection, and Recall

A hybrid AI memory system combining MemSearch and Hermes gives Claude Code persistent memory by storing all session data, injecting relevant context at the right moments, and enabling semantic recall with source citations. The system solves the fundamental limitation of large language models — their fixed context window — by capturing decisions, patterns, and preferences across sessions. Hermes handles structured storage and selective injection while MemSearch provides meaning-based retrieval, allowing developers to access past information without exact keyword matching.

How to Build a Hybrid AI Memory System for Claude Code: Storage, Injection, and Recall Learn how to combine MemSearch and Hermes to build a memory system that stores everything, injects smartly, and recalls by meaning with source citations. Why Claude Code Forgets Everything And How to Fix It Every developer who works with Claude Code runs into the same wall eventually. You’re mid-session on a complex project. Claude has context about your architecture, your naming conventions, your past decisions. Then the session ends — and the next time you open it, none of that exists. You’re starting from scratch. This isn’t a flaw in Claude Code specifically. It’s a fundamental constraint of how large language models work: they have a fixed context window, and nothing outside that window is accessible. For short tasks, this is fine. For ongoing development work — the kind where history, preferences, and accumulated knowledge actually matter — it’s a serious limitation. A hybrid AI memory system solves this. By combining a semantic recall layer MemSearch with a structured storage and injection layer Hermes , you can give Claude Code something close to persistent, meaningful memory: storing everything that happens, injecting relevant context at the right moments, and retrieving past information by meaning rather than by exact keyword match. This guide walks through how to build that system from scratch. What a Hybrid Memory System Actually Does Before getting into implementation, it’s worth being precise about what “memory” means in this context — because there are several distinct problems to solve. The Three Memory Problems Everyone else built a construction worker. We built the contractor. One file at a time. UI, API, database, deploy. Storage : By default, Claude Code doesn’t write anything to persistent storage. Once a session closes, everything is gone. You need a system that captures important information — decisions made, patterns identified, code explained — and saves it somewhere durable. Injection : Having stored memory doesn’t help if Claude never sees it. You need a mechanism that automatically retrieves relevant memory and injects it into Claude’s context at the start of a session or at key moments during a task. Injecting everything would overflow the context window, so the injection layer needs to be selective. Recall : When you want to ask “what did we decide about authentication?” or “show me how we handled rate limiting before,” you need semantic search — retrieval that works on meaning, not just text matching. This is the hardest part to get right. A hybrid system addresses all three separately and lets them work together. Why “Hybrid”? Pure vector search embedding-based retrieval is great at finding semantically similar content but loses structure and chronology. Pure key-value or relational storage is great at structure but can’t retrieve by meaning. A hybrid system uses both: Hermes handles structured storage and smart injection — maintaining metadata, session context, source tracking, and deciding what gets injected when MemSearch handles the semantic layer — embedding content into vectors, enabling meaning-based retrieval, and surfacing results with source citations Together, they cover the full memory lifecycle. Setting Up the Storage Layer with Hermes Hermes acts as the memory orchestration layer. Its job is to receive information from Claude Code sessions, store it with proper structure, and manage what gets injected into future sessions. What Hermes Stores Each memory entry should capture more than just the raw content. Useful metadata includes: Session ID — which conversation or work session this came from Timestamp — when the memory was created Memory type — decision, code snippet, explanation, user preference, error resolution Tags — project name, file names, technologies involved Source — what file, conversation turn, or command produced this memory Confidence — how important or reliable this memory is Without this structure, you end up with a flat pile of text blobs that’s hard to manage or filter. Structuring Memory Entries A practical schema for a memory entry looks something like this: { "id": "mem 20240712 001", "session id": "sess abc123", "timestamp": "2024-07-12T14:23:00Z", "type": "decision", "content": "We're using JWT tokens with 15-minute expiry for access and 7-day refresh tokens stored in httpOnly cookies.", "tags": "auth", "security", "tokens" , "source": "src/auth/middleware.ts", "project": "ecommerce-api", "importance": 0.9 } The importance score is useful for filtering during injection — high-importance memories architectural decisions, security choices, key patterns get injected more aggressively than low-importance ones minor style notes, one-off fixes . Capturing Memories During Sessions There are two approaches to capture: automatic and manual. Automatic capture hooks into Claude Code’s output stream and uses a secondary model to identify what’s worth saving. After each significant response, a lightweight classification step decides whether to store the content and how to categorize it. Manual capture gives you explicit control. A simple command — something like /remember content — triggers immediate storage. This is more reliable but requires discipline. One coffee. One working app. You bring the idea. Remy manages the project. In practice, both work best together. Automatic capture catches things you’d forget to save; manual capture lets you flag the decisions that really matter. The Injection Strategy When a new session starts, Hermes queries stored memories and selects a subset to inject into Claude’s system prompt or first user message. The selection logic should consider: Project match — only inject memories tagged to the current project Recency — newer memories are generally more relevant Importance — high-importance entries always make the cut Token budget — never inject more than a set percentage of the available context window a good default is 20-30% The injected memories appear as structured context, clearly labeled so Claude knows they’re retrieved history rather than live information: --- MEMORY CONTEXT --- 2024-07-10 DECISION: Authentication uses JWT with 15-min access tokens and 7-day refresh tokens in httpOnly cookies. Source: auth/middleware.ts 2024-07-11 PATTERN: All API errors follow the format {error: string, code: string, details?: object}. Source: types/errors.ts --- END MEMORY CONTEXT --- This framing helps Claude treat these as reliable background knowledge rather than conversation content. Building Semantic Recall with MemSearch Hermes handles structured storage and injection. MemSearch handles the other half: finding memories by meaning when you explicitly ask for them. How Semantic Search Works Here Every time a memory is stored by Hermes, MemSearch generates a vector embedding of the content using an embedding model. That embedding is stored alongside the memory entry in a vector database. When you ask a recall question — “how did we handle pagination?” — MemSearch: - Embeds the query using the same embedding model - Performs a similarity search in the vector database - Returns the top-N most semantically similar memories - Includes source citations for each result The result isn’t a keyword match. It finds memories that are about pagination even if they use different words — “cursor-based navigation,” “offset/limit patterns,” “scroll handling.” This is what makes semantic recall genuinely useful. Choosing an Embedding Model For a local or low-latency setup, models like text-embedding-3-small OpenAI or nomic-embed-text open source, runs locally work well. The key requirements are: - Consistent model use across storage and retrieval — if you embed with one model, you must query with the same one - Reasonable embedding dimensions 768–1536 works well for most use cases - Fast inference — memory injection shouldn’t add more than 200-300ms to session startup Setting Up the Vector Database Popular options for the vector store include: Chroma — open source, runs locally, easy to set up Qdrant — open source, production-ready, good filtering support Pinecone — managed service, minimal ops overhead pgvector — if you’re already using PostgreSQL, this avoids adding a new system For a Claude Code memory system, Chroma or Qdrant running locally is usually the right call. You get low latency, full control, and no data sent to external services. Source Citations in Recall Results One of the practical requirements for a useful memory system is knowing where a memory came from. When Claude tells you “we decided to use Redis for session storage,” you want to be able to verify that and trace it back to the original context. MemSearch handles this by returning the source metadata alongside each result. A recall query returns something like: Query: "session storage approach" Result 1 score: 0.94 : "Redis is used for session storage with 24-hour TTL. Sessions keyed by userId." Source: src/session/store.ts | Session: 2024-07-08 | Type: decision Result 2 score: 0.81 : "Session invalidation happens on logout and password change via Redis DEL." Source: src/auth/logout.ts | Session: 2024-07-09 | Type: code pattern This turns recall from a black box into a traceable, auditable process. Connecting MemSearch and Hermes: The Full Flow The two systems work together through a simple coordination layer. Here’s the full lifecycle: Write Path Storing a Memory - Claude Code produces output during a session - The capture layer automatic or manual identifies content worth saving - Hermes stores the memory entry with full metadata - MemSearch generates an embedding and stores it in the vector database - Both systems now have a reference to the same memory — Hermes for structured retrieval, MemSearch for semantic search Read Path Injecting Context at Session Start - New session begins with a project identifier - Hermes queries structured storage by project, filters by importance and recency, respects token budget - Selected memories are formatted and injected into Claude’s opening context - Session proceeds with relevant history already available Read Path Explicit Recall During a Session - User asks a recall question “how did we handle X before?” - MemSearch receives the query, generates an embedding, searches the vector store - Top results returned with source citations - Results injected into the next Claude turn as retrieved context The two read paths can run simultaneously — automatic injection for session startup, semantic recall for on-demand queries. Implementation: Putting It Together with Claude Code Here’s a practical implementation approach using the MindStudio Agent Skills Plugin, which handles the infrastructure layer so you can focus on the memory logic itself. Prerequisites - Claude Code installed and configured - Node.js 18+ for the coordination layer - A vector database Chroma recommended for local setup - The @mindstudio-ai/agent npm package Step 1: Install the Agent Skills Plugin npm install @mindstudio-ai/agent This gives your coordination layer access to MindStudio’s typed capabilities, including search, storage, and workflow execution — without managing API keys or rate limiting yourself. Step 2: Build the Memory Coordinator The coordinator is the bridge between Claude Code sessions and your MemSearch/Hermes systems. A minimal version: js import { agent } from '@mindstudio-ai/agent'; async function storeMemory content, metadata { // Store in Hermes structured await hermesStore.insert { content, ...metadata } ; // Store in MemSearch semantic const embedding = await generateEmbedding content ; await vectorStore.upsert { id: metadata.id, embedding, payload: metadata } ; } async function recallByMeaning query, projectId { const embedding = await generateEmbedding query ; const results = await vectorStore.search { embedding, filter: { project: projectId }, limit: 5 } ; return results.map r = { ...r.payload, score: r.score } ; } async function buildSessionContext projectId, tokenBudget { const memories = await hermesStore.query { project: projectId, minImportance: 0.7 } ; return selectWithinBudget memories, tokenBudget ; } Step 3: Hook Into Claude Code Sessions Claude Code supports custom system prompts and pre-session hooks. Use these to call buildSessionContext before each session and inject the formatted memory block. For explicit recall, you can either: - Add a /recall query command that calls recallByMeaning and returns formatted results - Configure a background watcher that monitors conversation turns and triggers recall automatically when certain patterns appear Step 4: Set Retention and Pruning Rules Memory systems get noisy over time. Define pruning rules upfront: TTL-based : memories older than 90 days drop to low importance unless explicitly pinned Deduplication : when very similar memories are stored, keep the newer one and update the embedding Importance decay : memories that are never retrieved lose importance score over time Project archival : when a project is marked inactive, its memories move to cold storage This keeps the system useful as it scales up. How MindStudio Fits Into This Architecture If you’re building this kind of memory system for Claude Code, the biggest operational headache isn’t the logic — it’s the infrastructure: managing rate limits across multiple APIs, handling retries when vector store operations fail, wiring together embeddings, storage, and retrieval without everything breaking when one piece changes. The MindStudio Agent Skills Plugin https://mindstudio.ai addresses exactly this. The @mindstudio-ai/agent SDK gives Claude Code and any other agent runtime access to 120+ typed capabilities as simple method calls, with the infrastructure layer already handled. For a memory system specifically, this means: - Calling agent.searchGoogle to pull in external context worth storing - Using agent.runWorkflow to trigger memory consolidation or summarization pipelines - Handling retries and rate limiting automatically, so your memory coordinator doesn’t need defensive code for every API call MindStudio’s no-code builder also lets you build the memory management UI — a dashboard for browsing stored memories, adjusting importance scores, or manually pinning critical context — without writing frontend code. Teams that work with Claude Code often want visibility into what’s in memory; MindStudio makes that dashboard a 30-minute build, not a side project. You can try it free at mindstudio.ai https://mindstudio.ai . Common Mistakes and How to Avoid Them Injecting Too Much Context The most common failure mode is greed: storing lots of memory and injecting most of it every session. This crowds out the actual task content, slows session startup, and often degrades Claude’s performance on the immediate work. Keep injection selective. A 20% token budget for memory context is a reasonable ceiling. Within that, prioritize importance score over recency. Mismatched Embedding Models If you embed at write time with one model and embed queries with another, similarity scores become meaningless. Lock your embedding model and treat any change as a migration that requires re-embedding your entire memory store. No Source Tracking Memory without provenance is hard to trust. If Claude says “we decided X,” and you can’t verify where that came from, you’re flying blind. Build source citations in from day one — retrofitting them is painful. Forgetting to Test Recall Quality It’s easy to build a system that stores things correctly but retrieves the wrong ones. After initial setup, run a set of recall test queries against your real stored memories. If the top results aren’t what you’d expect, tune your embedding model, similarity threshold, or metadata filters before relying on the system for real work. Frequently Asked Questions What is a hybrid AI memory system? Remy doesn't write the code. It manages the agents who do. Remy runs the project. The specialists do the work. You work with the PM, not the implementers. A hybrid AI memory system combines two complementary approaches: structured storage with metadata filtering for precise, rule-based retrieval and semantic vector search for meaning-based recall . Neither approach alone handles all retrieval needs well. Structured storage can’t find conceptually similar content; vector search loses chronology and structure. Combining them covers the full range of memory access patterns an AI coding agent needs. How does semantic recall differ from keyword search? Keyword search finds exact or near-exact text matches. Semantic recall finds content that is about the same thing, even if different words are used. If you stored a memory about “token-based authentication” and query for “how does login work,” semantic search returns the relevant result; keyword search likely misses it. This matters a lot in coding contexts, where the same concept gets described multiple ways across different files and sessions. Does this work with Claude Code specifically, or any AI coding tool? The architecture works with any AI coding assistant that accepts a configurable system prompt or pre-session context injection. Claude Code is a good fit because it’s designed for longer-horizon agentic tasks where persistent memory provides the most value. The same pattern applies to multi-agent workflows https://mindstudio.ai/blog where multiple agents share a memory pool. How many memories can the system handle before performance degrades? Vector search scales well — modern vector databases handle millions of entries with sub-100ms query times. The bottleneck is usually the injection layer: how many tokens of memory you can include in context without hurting Claude’s performance on the actual task. A well-tuned system with 100,000+ memory entries can still inject only the most relevant 20-30 entries per session, keeping context tight. How do I handle sensitive information in stored memories? Don’t store credentials, API keys, or personally identifiable information in the memory system. Use environment variables for secrets as you normally would, and configure your capture layer to redact or skip content that matches sensitive patterns before storage. For team environments, also consider access controls on the vector database — who can read or write memories should match your existing permissions model. Can multiple developers share the same memory system? Yes, with some caveats. Shared memory works well for project-level knowledge: architectural decisions, coding patterns, known bugs, established conventions. Personal preferences and individual workflow patterns should stay in user-scoped memory. Tag memories with both a project identifier and a user identifier, then query both namespaces at session start — project memories injected for everyone, user memories injected only for the relevant user. Key Takeaways - Claude Code’s context window limitation is a real constraint for ongoing development work — a persistent memory system directly addresses it - Hermes handles structured storage, metadata management, and smart injection based on importance, recency, and token budget - MemSearch handles semantic recall using vector embeddings, returning results with source citations for traceability - The hybrid approach covers both structured filtering and meaning-based retrieval — neither alone is sufficient - Source citations and importance scoring are non-negotiable from day one; retrofitting them is significantly harder - MindStudio’s Agent Skills Plugin handles the infrastructure layer, letting you focus on memory logic rather than API plumbing If you’re building with Claude Code and want persistent, intelligent memory without rebuilding the infrastructure from scratch, MindStudio https://mindstudio.ai is worth exploring — especially for teams that also need a management interface or want to connect memory workflows to the rest of their tooling.