{"slug": "persistent-memory-for-ollama-in-about-five-minutes", "title": "Persistent memory for Ollama, in about five minutes", "summary": "A developer built Sieve, an open-source proxy that adds persistent, encrypted memory to Ollama's stateless local LLM endpoint. Sieve sits between the client and Ollama, stripping redundant history and injecting relevant facts from a local encrypted store, without requiring any client code changes. The proxy supports both Ollama-native and OpenAI-compatible APIs, and can be installed via pipx in about five minutes.", "body_md": "Originally published on[the Sieve blog]. Sieve is an open-source (Apache 2.0) context-reduction proxy — I work on it. This is a how-to, not a pitch; the steps work against any Ollama setup.\n\nOllama gives you a local LLM endpoint that is fast, private, and completely stateless. Close the chat, and everything you told the model is gone. Keep the chat open, and every turn re-sends a growing history until the context window fills up. Ask a local model about something it was never told, and — depending on the model — it may simply make something up.\n\nThis guide adds a persistent, encrypted memory to any Ollama setup using Sieve, without changing your client code beyond one URL.\n\nThree separate annoyances show up when you run agents or long-lived chats against a local model, and they have a common root.\n\n**Nothing survives the session.** Tell your assistant on Monday that you prefer Python and your deploy target is a Raspberry Pi, and on Tuesday it knows neither. The model has no state; the application has to carry all of it, every time.\n\n**The payload only grows.** The standard workaround is to re-send history: system prompt, tool schemas, every prior turn, on every request. I measured the consequences of that pattern in [The hidden cost of context](https://llmsieve.dev/blog/2026/06/10/the-hidden-cost-of-context/) — the short version is that per-turn cost grows with conversation length, and on local hardware that growth comes out of your tokens-per-second.\n\n**Absence becomes fabrication.** When a question falls outside the context you did send, smaller models in particular tend to answer anyway. A model that was never told your colleague's name will, often enough, invent one.\n\nThe common root: the endpoint is stateless and the burden of memory falls on whatever sits in front of it. Most memory frameworks ask you to adopt an SDK and call `add()`\n\n/`search()`\n\nyourself. The approach here is different — put the memory *in the traffic path*, so the client stays unchanged. I wrote up why I prefer the proxy shape in [Why Sieve](https://llmsieve.dev/blog/2026/06/09/why-sieve/).\n\n```\nyour client ──► Sieve (127.0.0.1:11435) ──► Ollama (127.0.0.1:11434)\n                 │\n                 └── encrypted store at ~/.sieve/memory.db\n```\n\nSieve speaks Ollama's native `/api/chat`\n\nas well as the OpenAI-compatible `/v1/chat/completions`\n\n, so anything that can talk to Ollama can talk to Sieve. On each turn it strips repeated instructions, tool schemas, and stale history from the outbound payload; learns durable facts from the conversation; and injects the relevant ones back in when a later turn actually needs them. The reply comes back to your client unchanged.\n\nYou need Python 3.11+ and a running Ollama. The recommended installer is [pipx](https://pipx.pypa.io/):\n\n```\npipx install llm-sieve\nsieve --version   # sieve, version 1.0.0 or later\n```\n\nThen run the guided setup:\n\n```\nsieve-install\n```\n\nIf Ollama is running on `127.0.0.1:11434`\n\n, the installer auto-detects it, shows you the models you already have pulled, downloads a ~50 MB embedding model (one-time), creates the encrypted store, and offers to start the proxy — with optional autostart on reboot. For a scripted, no-prompts install:\n\n```\nsieve-install --no-input \\\n  --provider http://127.0.0.1:11434 \\\n  --model qwen3.5:9b\n```\n\nSieve listens on `11435`\n\n— deliberately one port up from Ollama's `11434`\n\n. Wherever your client points at Ollama, point it at Sieve instead.\n\n**Ollama-native clients:**\n\n```\nexport OLLAMA_HOST=http://127.0.0.1:11435\n```\n\n**OpenAI-compatible clients:**\n\n```\nclient = OpenAI(\n    base_url=\"http://127.0.0.1:11435/v1\",  # was: http://127.0.0.1:11434/v1\n    api_key=\"not-used-by-sieve\",           # still forwarded upstream\n)\n```\n\n**Or just curl it:**\n\n```\ncurl http://127.0.0.1:11435/api/chat \\\n  -H 'Content-Type: application/json' \\\n  -d '{\n    \"model\": \"qwen3.5:9b\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"Hi, my name is Alex and I work on embedded firmware.\"}],\n    \"stream\": false\n  }'\n```\n\nModel names, request shapes, response formats, streaming — all unchanged. The client does not know Sieve exists.\n\nTwo built-in commands, both sandboxed (they never touch your real store):\n\n```\nsieve demo\n```\n\nruns a short scripted conversation: an identity introduces itself, shares facts, asks for them back, and closes with a question about a person who was never mentioned. What you want to see: recall hits on the seeded facts, and a *refusal* — not a fabrication — on the trap question.\n\n```\nsieve benchmark\n```\n\nsends the same 15 messages directly to your model and through Sieve, then prints a delta table: tokens in vs out, facts learned, response times, and the trap verdict. Five to ten minutes depending on your hardware, and the numbers are yours rather than mine.\n\nEvery response Sieve touches carries diagnostic headers, so you don't have to take the proxy's behaviour on faith:\n\n| Header | What it tells you |\n|---|---|\n`X-Sieve-Inbound-Tokens` |\nPayload size before the trim |\n`X-Sieve-Outbound-Tokens` |\nPayload size actually sent to Ollama |\n`X-Sieve-Phase` |\n`OBSERVE` / `ACCUMULATE` / `ACTIVATE`\n|\n`X-Sieve-Fact-Count` |\nFacts in the store right now |\n`X-Sieve-Proxy-Us` |\nSieve's own overhead, in microseconds |\n\nThe inbound/outbound pair is the one to watch first: it's the per-request answer to \"is this actually doing anything?\" The full list is in the [diagnostic headers](https://llmsieve.dev/diagnostic-headers/) reference.\n\nOne thing to expect: **the first few turns feel like pass-through.** Sieve activates progressively — it observes before it accumulates, and accumulates before it actively trims and injects. `X-Sieve-Phase`\n\ntells you exactly where it is in that ramp, and `sieve status`\n\nshows the fact count growing.\n\nEverything stays on your machine. Facts, entities, and episodes land in a SQLCipher-encrypted SQLite database at `~/.sieve/memory.db`\n\n, with the keyfile alongside it. There is no cloud component, no account, and no telemetry — the proxy talks to exactly one remote party, and it's the LLM endpoint you configured. If that endpoint is Ollama on localhost, nothing leaves the box at all.\n\nThe store belongs to you, not to the package: upgrades via `pipx upgrade llm-sieve`\n\nnever touch `~/.sieve/`\n\n, and the only command that deletes user data is `sieve uninstall --hard`\n\n, which makes you type `DELETE`\n\nfirst.\n\n**Small models still have small-model problems.** Sieve can put the right facts in front of the model and refuse to let absence turn into invention on the turns it gates, but a 1–3B model under ambiguity is still a 1–3B model. The demo's trap turn is the honest check — run it against the model you actually plan to use. Models in the 8B+ class are where the absence-handling shines.\n\n**Cold start is real.** A memory layer with nothing in it can't save you tokens yet. Budget a handful of turns before the deltas get interesting.\n\n**Port collisions happen.** If something already owns `11435`\n\n, run `sieve start --port 11436`\n\nand point your client there instead.\n\n```\npipx install llm-sieve\nsieve-install            # auto-detects Ollama, guided from there\nexport OLLAMA_HOST=http://127.0.0.1:11435\nsieve demo               # watch the recall hits and the trap refusal\n```\n\nOne URL changed, no SDK adopted, no client code rewritten — and your Ollama models stop forgetting who you are between sessions.\n\n*Sieve is open source under Apache 2.0: github.com/llmsieve/llm-sieve.*", "url": "https://wpnews.pro/news/persistent-memory-for-ollama-in-about-five-minutes", "canonical_source": "https://dev.to/azard_tennant-hosein/persistent-memory-for-ollama-in-about-five-minutes-4co5", "published_at": "2026-06-27 07:14:26+00:00", "updated_at": "2026-06-27 07:33:50.142185+00:00", "lang": "en", "topics": ["large-language-models", "developer-tools", "ai-infrastructure"], "entities": ["Sieve", "Ollama", "pipx", "qwen3.5:9b"], "alternates": {"html": "https://wpnews.pro/news/persistent-memory-for-ollama-in-about-five-minutes", "markdown": "https://wpnews.pro/news/persistent-memory-for-ollama-in-about-five-minutes.md", "text": "https://wpnews.pro/news/persistent-memory-for-ollama-in-about-five-minutes.txt", "jsonld": "https://wpnews.pro/news/persistent-memory-for-ollama-in-about-five-minutes.jsonld"}}