Show HN: Mnemo – local-first AI memory layer for any LLM (Rust, SQLite,petgraph)

wpnews.pro

Local-first AI memory layer for any LLM. Persistent knowledge graph, entity extraction, semantic retrieval — no cloud required.

Most LLMs forget everything the moment a conversation ends. mnemo fixes that.

mnemo is a sidecar service that watches every conversation you feed it, extracts named entities and relationships using an LLM, builds a persistent knowledge graph in SQLite, and injects relevant context back into future prompts — automatically, in under 50ms. It works with Ollama (fully local, free), OpenAI, Anthropic, or any OpenAI-compatible API. It ships as a single static binary with zero cloud dependency.

  your app
     │
     ▼
  POST /ingest ──► entity extraction (LLM) ──► knowledge graph (SQLite + petgraph)
                                                        │
  POST /retrieve ◄── scoring + ranking ◄── graph traversal + full-text search
     │
     ▼
  context_prompt  ──► inject into your LLM prompt

You POST raw text to /ingest

(a conversation turn, a document, a note). - mnemo sends it to your configured LLM and extracts entities (people, tools, places, concepts) and the relationships between them.

Entities are deduplicated by name+type, aliases are merged, and everything is written to SQLite. The in-memory petgraph is updated atomically.
On POST /retrieve

, mnemo runs a 6-stage pipeline: full-text chunk search → entity name search → graph expansion (BFS over the knowledge graph) → relation filter → score+rank → assemble acontext_prompt

string. - You inject context_prompt

into your LLM's system prompt. Done.

git clone https://github.com/zaydmulani09/mnemo
cd mnemo
docker compose up -d

docker exec mnemo-ollama ollama pull llama3

curl http://localhost:8080/health
cargo install --path crates/mnemo-api

export MNEMO_LLM_BASE_URL=http://localhost:11434/v1
mnemo-api

export MNEMO_LLM_BASE_URL=https://api.openai.com/v1
export MNEMO_LLM_API_KEY=sk-...
export MNEMO_LLM_MODEL=gpt-4o-mini
export MNEMO_LLM_PROVIDER=openai
mnemo-api
pip install mnemo-sdk
python
from mnemo import MnemoClient

client = MnemoClient()  # server at http://localhost:8080

client.ingest("I'm building a Rust vector database called vecdb")

print(client.get_context("what am I working on?"))

All endpoints accept and return application/json

. Base URL: http://localhost:8080

.

Method	Path	Description
`GET`
`/health`
Server + DB + LLM status	—	`HealthResponse`
`POST`
`/ingest`
Store text, extract entities	`IngestRequest`
`IngestResponse`
`POST`
`/retrieve`
Retrieve ranked memory context	`RetrievalQuery`
`RetrievalResult`
`GET`
`/entities`
List entities (paginated)	`?limit&offset`
`Entity[]`
`GET`
`/entities/:id`
Get entity by UUID	—	`Entity`
`DELETE`
`/entities/:id`
Delete entity (cascades)	—	`{"deleted":true}`
`GET`
`/entities/:id/neighbors`
Knowledge graph neighbors	`?depth` (max 5)
`GraphNode[]`
`GET`
`/chunks`
List memory chunks (paginated)	`?limit&offset&session_id`
`MemoryChunk[]`
`GET`
`/chunks/:id`
Get chunk by UUID	—	`MemoryChunk`
`DELETE`
`/chunks/:id`
Delete chunk	—	`{"deleted":true}`
`POST`
`/search`
Full-text search entities + chunks	`{"query","limit"}`
`{"entities","chunks"}`
`DELETE`
`/wipe`
Delete all memory (irreversible)	header: `X-Confirm-Wipe: true`
`{"wiped":true}`
`GET`
`/stats`
Entity/chunk/graph counts + uptime	—	`StatsResponse`

Key request/response types:

// IngestRequest
{
  "content": "string",         // required — text to store
  "source":  "string",         // required — e.g. "chat", "email", "cli"
  "session_id": "string|null", // optional — group related chunks
  "metadata": {}               // optional — arbitrary JSON
}

// RetrievalQuery
{
  "text": "string",            // required — query text
  "session_id": "string|null", // optional — filter by session
  "max_chunks": 10,            // default 10
  "max_entities": 20,          // default 20
  "min_confidence": 0.5,       // default 0.5
  "include_graph": true,       // default true — expand via knowledge graph
  "graph_depth": 2             // default 2 — BFS depth for graph expansion
}

Full endpoint documentation with curl examples: docs/api.md

Variable	Default	Description
`MNEMO_DB_PATH`
`mnemo.db`
SQLite database file path
`MNEMO_PORT`
`8080`
API server port
`MNEMO_LLM_BASE_URL`
`http://localhost:11434/v1`
OpenAI-compatible LLM base URL
`MNEMO_LLM_MODEL`
`llama3`
Model name for entity extraction
`MNEMO_LLM_API_KEY`
`ollama`
API key (any value works for Ollama)
`MNEMO_LLM_PROVIDER`
`ollama`
Provider type: `ollama` , `openai` , `anthropic` , `custom`

Pass --config path/to/config.toml

to mnemo-api

. See mnemo.example.toml

:

db_path = "mnemo.db"
port = 8080

[llm]
provider = "ollama"
base_url = "http://localhost:11434/v1"
model = "llama3"
api_key = "ollama"
timeout_secs = 30
max_retries = 3
max_tokens = 2048
temperature = 0.1

Environment variables take precedence over TOML values. The active config source is reported in GET /health

→ config_source

.

Install:

cargo install --path crates/mnemo-cli

Usage:

mnemo ingest "I use Neovim and prefer dark mode"

mnemo search "what editor do I use?"

mnemo entities

mnemo entity <uuid> --neighbors

mnemo chunks

mnemo health

mnemo stats

mnemo wipe

mnemo wipe --yes

mnemo --server http://192.168.1.10:8080 stats

Install:

pip install mnemo-sdk

See sdk/python/README.md for the full API reference.

Async example:

import asyncio
from mnemo import AsyncMnemoClient

async def main():
    async with AsyncMnemoClient() as client:
        await client.ingest(
            "Alice is a principal engineer at Stripe working on payment infrastructure.",
            session_id="session-001",
        )
        context = await client.get_context(
            "what does Alice work on?",
            session_id="session-001",
        )
        print(context)

asyncio.run(main())

A working standalone example: examples/basic_usage.py

Four Rust crates wired together:

Crate	Type	Role
`mnemo-core`
lib	Entity extraction, graph ops, retrieval engine, DB layer
`mnemo-api`
bin	Axum REST API — thin handler layer over mnemo-core
`mnemo-cli`
bin	CLI tool using blocking reqwest against the API
`mnemo-bench`
bin	Performance benchmarks (12 suites)

Full architecture documentation: docs/architecture.md

Benchmarked on Apple M2, SQLite WAL mode, in-memory petgraph. Debug build numbers — release build (--release

) is 3–5× faster.

Operation	Avg latency	Throughput
Entity insert (SQLite)	~0.12 ms	~8,300 ops/s
Entity lookup by ID	~0.08 ms	~12,500 ops/s
Chunk insert	~0.14 ms	~7,100 ops/s
Full-text chunk search	~0.28 ms	~3,500 ops/s
Graph neighbor (depth=1)	~0.21 ms	~4,700 ops/s
Graph neighbor (depth=2)	~0.89 ms	~1,100 ops/s
Full retrieval pipeline	~4.2 ms	~238 ops/s

Run cargo run -p mnemo-bench

to benchmark on your hardware.

cargo test --workspace          # run all 122 tests
make coverage                  # HTML coverage report (requires cargo-llvm-cov)
make coverage-summary          # summary to stdout
cd sdk/python && pytest tests/ -v
cargo run -p mnemo-bench                    # all 12 benchmarks
cargo run -p mnemo-bench -- --filter graph  # graph benchmarks only
cargo run -p mnemo-bench -- --json out.json # save results to JSON

Current test counts: 122 Rust tests · 21 Python tests · 12 benchmarks

PRs welcome. Please run make fmt && make lint

before submitting. Open an issue first for large changes.

See CONTRIBUTING.md for full setup instructions, code style guide, and how to add a new LLM provider.

MIT — see LICENSE

source & further reading

github.com — original article

Show HN: Mnemo – local-first AI memory layer for any LLM (Rust, SQLite,petgraph)

Run your AI side-project on zahid.host