Why I Built Local-First Agent Memory (And Why Embeddings Were the Wrong Tool)

wpnews.pro

When an AI agent needs to remember what it has done, most product roadmaps point straight to vector embeddings. The promise is simple: turn a conversation into a high-dimensional point, store it, and later ask the model to find the nearest point. In practice, that promise brings a host of hidden costs -- cloud latency, opaque storage, and a dependency on a constantly-trained embedding model. As a data engineer who spends every day balancing storage efficiency against query speed, I found those trade-offs hard to accept for a tool that should feel as immediate as a local file.

In the first months of building LoreConvo, I set out to prove that a well-tuned full-text search engine could give the same recall quality without the baggage of embeddings. The result is a local-first memory layer that lives in a single SQLite file, works offline, and integrates with every major AI surface I use. Below I walk through the problem with embedding-only memory, explain why SQLite + FTS5 is a practical alternative, share what I observed in real use, and outline the scenarios where a hybrid approach still makes sense.

Embeddings feel powerful, but they hide complexity #

Embedding models are attractive because they turn any string into a fixed-size vector. Once you have a vector, you can drop it into a nearest-neighbor index and retrieve "similar" sessions with a single dot-product. The mental model is clean, and many cloud providers ship it as a managed service.

The hidden side-effects, however, quickly surface in a production workflow. First, every new session must be sent to the embedding endpoint, which adds network latency and a cost per call. For a team running dozens of agent sessions a day, those costs accumulate. Second, the index lives outside the developer's control. When the provider updates the model, the stored vectors may no longer be comparable, forcing a costly re-index. Third, the vectors themselves are opaque; you cannot inspect them to understand why a particular session was returned, which makes debugging a challenge for anyone who relies on reproducibility.

Finally, embeddings do not respect the natural boundaries of a project. A data pipeline may contain dozens of unrelated sub-tasks, each with its own terminology. A pure vector index will happily return a session that shares a few generic words, even if the context is completely different. The result is recall that feels "close enough" but often misses the precise decision or tool call that the engineer needs.

Full-text search in SQLite gives deterministic recall #

SQLite has been the workhorse of embedded storage for decades, and its FTS5 extension adds powerful full-text capabilities. By indexing every token in a session, FTS5 can answer keyword queries with prefix matching, compound token expansion, and ranking based on term frequency. The engine runs entirely in the same process as the agent, so there is no network hop and no external cost.

When I built the cross-surface session memory for LoreConvo, I stored each conversation as a row in a single SQLite file. The schema captures the raw transcript, a concise summary, the list of tools used, and any tags attached to the session. The auto-save hook runs at the end of every session, extracts a heuristic summary, and writes the row without any user action required. Because the file is owned by the user, backup, migration, and version control are straightforward -- export and import commands produce JSON that can be checked into a repo or moved to a new machine in seconds.

The real surprise came from search performance. A keyword query that includes a project tag and a skill name returns the most relevant sessions in well under a second on a laptop CPU. Adding a simple recency boost -- sorting on the session timestamp -- yields results that feel as fresh as a vector index, but with deterministic ranking that can be inspected directly in the CLI. For a direct performance comparison, see the benchmark we ran pitting FTS5 against Chromadb and other vector-backed alternatives.

Because the index lives in the same file, the storage overhead is modest. A SQLite database holding thousands of sessions occupies a fraction of what the same number of high-dimensional vectors would require in a separate service. The file can be compressed, copied, or shared with teammates using the team memory export feature, which merges selected sessions into a new SQLite file without any server.

What I observed running this in production #

The patterns that emerged from running LoreConvo against my own agent fleet reinforced the design choices.

Recall quality for keyword-heavy engineering queries was consistently high. When I searched for a specific tool call -- a particular function name, a module path, a configuration flag -- FTS5 returned the right session on the first page almost every time. The cases where it missed were sessions where the call appeared only inside a code block that was tokenized differently; adding a secondary index on code-block content closed most of that gap.

Query speed stayed responsive even as the database grew. Because FTS5 runs in-process, there is no cold-start latency. Sessions saved in the morning were immediately searchable in the afternoon without any indexing delay.

Debuggability was the most underrated benefit. The inspect

command lets you list sessions, filter by tag, and view the raw transcript. When a query returned an unexpected result, I could open the session directly from the CLI and see the exact line that matched. That level of transparency is not available with a black-box vector store, and it turned debugging from a frustrating hunt into a five-minute exercise.

Cost predictability was a genuine relief. Every session write and every search ran at zero marginal cost. The only API spend was for the optional background summarizer -- a Claude Haiku call that upgrades auto-saved sessions to LLM-quality summaries on Pro -- and that was opt-in with a configurable daily cap.

When embeddings still have a role #

I am not arguing that embeddings are useless. There are scenarios where semantic similarity goes beyond keyword overlap. If a team frequently asks abstract questions -- "how did we handle authentication across services?" -- and expects the system to surface sessions that discuss OAuth, JWT, and API keys without the exact keywords, a hybrid approach can help. I wrote more about how to evaluate these trade-offs in benchmark hype vs real memory: what actually matters when you choose a memory tool.

LoreConvo's Pro tier offers a semantic search layer built on LanceDB that combines vector similarity (using the BGE-small-en embedding model) with the existing BM25 ranking from FTS5. The fusion algorithm applies reciprocal rank fusion and a recency decay reranker, delivering results that capture both exact matches and conceptually related sessions. The hybrid index is built once with the rebuild-index

command and can be refreshed on demand.

The key is to treat embeddings as an augmentation, not a replacement. By keeping the primary memory in SQLite, you retain the deterministic core that works offline and at zero cost. When you need the extra semantic reach, you enable the hybrid layer on top of the same data file, preserving all the backup and export capabilities you already rely on.

Building a memory layer that respects the engineer's workflow #

The design decisions behind LoreConvo were guided by three principles: locality, transparency, and control. Locality means the entire memory lives on the developer's machine, eliminating cloud latency and giving instant access to the full history. Transparency comes from CLI tools that let you inspect, export, and delete sessions with a single command. Control is provided by hooks that automatically load relevant context at the start of a session and save a concise summary at the end, without requiring any manual steps.

Cross-vendor MCP compatibility ensures that the same SQLite file can be used from Claude Code, OpenAI Codex, Cursor, and Hermes Agent -- zero per-client setup required. The .mcp.json

configuration file is placed in the project root, and each client reads it without additional setup. For environments that cannot register an MCP server, Python fallback scripts provide direct access to the memory layer.

Team collaboration stays local-first as well. Pro users can export selected sessions to JSON, share the file, and merge it into a teammate's database with the loreconvo merge

command. No central server is required, which aligns with the security policies of many data-sensitive organizations.

By focusing on a single, portable SQLite file, LoreConvo sidesteps the hidden costs of embedding services while still offering an optional semantic layer for those who need it. The result is a memory system that feels immediate, is easy to audit, and scales with the engineer's own hardware rather than a provider's billing cycle.

If you want to try a memory layer that puts you in control, start with the free tier or explore Pro for the hybrid search and team-memory features. Everything is documented at /tools. If you are evaluating memory architecture for an agent fleet and want to think through the trade-offs with someone who has run it in production, reach out at /contact.

source & further reading

labyrinthanalyticsconsulting.com — original article Consent-First AI Architectures -- Building Systems Your Team Trusts Claude Code + LoreConvo vs. Hermes Agent: Picking a Developer Memory Stack FTS5 vs. ChromaDB on 217 real sessions: our engine got 30%, ChromaDB got 100%