Persistent memory for Ollama, in about five minutes

A developer built Sieve, an open-source proxy that adds persistent, encrypted memory to Ollama's stateless local LLM endpoint. Sieve sits between the client and Ollama, stripping redundant history and injecting relevant facts from a local encrypted store, without requiring any client code changes. The proxy supports both Ollama-native and OpenAI-compatible APIs, and can be installed via pipx in about five minutes.

Originally published on the Sieve blog . Sieve is an open-source Apache 2.0 context-reduction proxy — I work on it. This is a how-to, not a pitch; the steps work against any Ollama setup. Ollama gives you a local LLM endpoint that is fast, private, and completely stateless. Close the chat, and everything you told the model is gone. Keep the chat open, and every turn re-sends a growing history until the context window fills up. Ask a local model about something it was never told, and — depending on the model — it may simply make something up. This guide adds a persistent, encrypted memory to any Ollama setup using Sieve, without changing your client code beyond one URL. Three separate annoyances show up when you run agents or long-lived chats against a local model, and they have a common root. Nothing survives the session. Tell your assistant on Monday that you prefer Python and your deploy target is a Raspberry Pi, and on Tuesday it knows neither. The model has no state; the application has to carry all of it, every time. The payload only grows. The standard workaround is to re-send history: system prompt, tool schemas, every prior turn, on every request. I measured the consequences of that pattern in The hidden cost of context https://llmsieve.dev/blog/2026/06/10/the-hidden-cost-of-context/ — the short version is that per-turn cost grows with conversation length, and on local hardware that growth comes out of your tokens-per-second. Absence becomes fabrication. When a question falls outside the context you did send, smaller models in particular tend to answer anyway. A model that was never told your colleague's name will, often enough, invent one. The common root: the endpoint is stateless and the burden of memory falls on whatever sits in front of it. Most memory frameworks ask you to adopt an SDK and call add / search yourself. The approach here is different — put the memory in the traffic path , so the client stays unchanged. I wrote up why I prefer the proxy shape in Why Sieve https://llmsieve.dev/blog/2026/06/09/why-sieve/ . your client ──► Sieve 127.0.0.1:11435 ──► Ollama 127.0.0.1:11434 │ └── encrypted store at ~/.sieve/memory.db Sieve speaks Ollama's native /api/chat as well as the OpenAI-compatible /v1/chat/completions , so anything that can talk to Ollama can talk to Sieve. On each turn it strips repeated instructions, tool schemas, and stale history from the outbound payload; learns durable facts from the conversation; and injects the relevant ones back in when a later turn actually needs them. The reply comes back to your client unchanged. You need Python 3.11+ and a running Ollama. The recommended installer is pipx https://pipx.pypa.io/ : pipx install llm-sieve sieve --version sieve, version 1.0.0 or later Then run the guided setup: sieve-install If Ollama is running on 127.0.0.1:11434 , the installer auto-detects it, shows you the models you already have pulled, downloads a ~50 MB embedding model one-time , creates the encrypted store, and offers to start the proxy — with optional autostart on reboot. For a scripted, no-prompts install: sieve-install --no-input \ --provider http://127.0.0.1:11434 \ --model qwen3.5:9b Sieve listens on 11435 — deliberately one port up from Ollama's 11434 . Wherever your client points at Ollama, point it at Sieve instead. Ollama-native clients: export OLLAMA HOST=http://127.0.0.1:11435 OpenAI-compatible clients: client = OpenAI base url="http://127.0.0.1:11435/v1", was: http://127.0.0.1:11434/v1 api key="not-used-by-sieve", still forwarded upstream Or just curl it: curl http://127.0.0.1:11435/api/chat \ -H 'Content-Type: application/json' \ -d '{ "model": "qwen3.5:9b", "messages": {"role": "user", "content": "Hi, my name is Alex and I work on embedded firmware."} , "stream": false }' Model names, request shapes, response formats, streaming — all unchanged. The client does not know Sieve exists. Two built-in commands, both sandboxed they never touch your real store : sieve demo runs a short scripted conversation: an identity introduces itself, shares facts, asks for them back, and closes with a question about a person who was never mentioned. What you want to see: recall hits on the seeded facts, and a refusal — not a fabrication — on the trap question. sieve benchmark sends the same 15 messages directly to your model and through Sieve, then prints a delta table: tokens in vs out, facts learned, response times, and the trap verdict. Five to ten minutes depending on your hardware, and the numbers are yours rather than mine. Every response Sieve touches carries diagnostic headers, so you don't have to take the proxy's behaviour on faith: | Header | What it tells you | |---|---| X-Sieve-Inbound-Tokens | Payload size before the trim | X-Sieve-Outbound-Tokens | Payload size actually sent to Ollama | X-Sieve-Phase | OBSERVE / ACCUMULATE / ACTIVATE | X-Sieve-Fact-Count | Facts in the store right now | X-Sieve-Proxy-Us | Sieve's own overhead, in microseconds | The inbound/outbound pair is the one to watch first: it's the per-request answer to "is this actually doing anything?" The full list is in the diagnostic headers https://llmsieve.dev/diagnostic-headers/ reference. One thing to expect: the first few turns feel like pass-through. Sieve activates progressively — it observes before it accumulates, and accumulates before it actively trims and injects. X-Sieve-Phase tells you exactly where it is in that ramp, and sieve status shows the fact count growing. Everything stays on your machine. Facts, entities, and episodes land in a SQLCipher-encrypted SQLite database at ~/.sieve/memory.db , with the keyfile alongside it. There is no cloud component, no account, and no telemetry — the proxy talks to exactly one remote party, and it's the LLM endpoint you configured. If that endpoint is Ollama on localhost, nothing leaves the box at all. The store belongs to you, not to the package: upgrades via pipx upgrade llm-sieve never touch ~/.sieve/ , and the only command that deletes user data is sieve uninstall --hard , which makes you type DELETE first. Small models still have small-model problems. Sieve can put the right facts in front of the model and refuse to let absence turn into invention on the turns it gates, but a 1–3B model under ambiguity is still a 1–3B model. The demo's trap turn is the honest check — run it against the model you actually plan to use. Models in the 8B+ class are where the absence-handling shines. Cold start is real. A memory layer with nothing in it can't save you tokens yet. Budget a handful of turns before the deltas get interesting. Port collisions happen. If something already owns 11435 , run sieve start --port 11436 and point your client there instead. pipx install llm-sieve sieve-install auto-detects Ollama, guided from there export OLLAMA HOST=http://127.0.0.1:11435 sieve demo watch the recall hits and the trap refusal One URL changed, no SDK adopted, no client code rewritten — and your Ollama models stop forgetting who you are between sessions. Sieve is open source under Apache 2.0: github.com/llmsieve/llm-sieve.