Building a Private RAG System: Lessons from a Local-First AI Journal The article details the technical architecture of DiaryGPT, a private, local-first AI journaling application that processes user data entirely on-device by default using Ollama for embeddings and language models. It explains how the system uses Retrieval-Augmented Generation (RAG) to perform semantic search on encrypted diary entries stored in SQLite, retrieving only the most relevant excerpts for AI analysis without sending the full diary to external services. Additionally, the companion mode features a hardcoded crisis detection system that bypasses the LLM entirely to provide reliable emergency resources. Most AI apps quietly send your data to the cloud. DiaryGPT does the opposite — and this is the full technical story. The Problem With AI + Private Data When you write in a journal, you write the things you'd never say out loud. The last thing you want is that text sitting on someone else's server, used to train a model, or exposed in a breach. But AI is genuinely useful for journaling. It can find patterns you miss, reflect things back to you, ask questions a blank page never would. The tension is real: you want AI insight without sacrificing privacy. Most apps solve this by trusting a privacy policy. I wanted a technical guarantee. So I built DiaryGPT — an AI-powered personal journal where, by default, zero data leaves your machine. Here's exactly how it works. What DiaryGPT Does Before the architecture, here's what the app gives you: - AI mood analysis on every entry — mood, themes, a reflective response, and a follow-up question - RAG-powered chat — ask "when was I most anxious?" and get answers grounded in your actual entries - Semantic search — find entries by meaning, not keywords "times I felt lonely" finds entries with "isolated", "disconnected", "blue" - Weekly reflection — AI summary of your emotional arc across the week - Personalized journaling prompts — generated from your recent writing patterns - Writing streaks and memories — "on this day last year you wrote…" - AI companion mode — CBT/DBT-grounded reflection with built-in crisis detection not a replacement for a licensed therapist - Mood check-ins — 1–10 logging with history chart - Voice dictation and voice chat — speak entries, hear responses read back - Full AES-256-GCM encryption at rest — every diary entry, chat message, and note The Privacy Architecture DiaryGPT has two modes. You choose in Settings. 🟢 Local Mode Default Everything runs on your machine. The AI model, the search, the analysis — all local via Ollama https://ollama.com/ . Your diary entry ↓ Ollama nomic-embed-text → converts to numbers → saved in SQLite ↓ Ollama llama3.2 / qwen2.5 → analyzes mood → saved encrypted Zero data leaves your machine. 🟡 Cloud Mode Opt-in For users who want higher reasoning quality and are comfortable with API transit. You bring your own API key — Groq, OpenAI, Anthropic, or Gemini. The key is stored locally. Your diary entry ↓ Ollama embeddings → still local, nothing sent ↓ Top 5 relevant excerpts → your provider's API → answer streams back Only a small slice of your diary transits. Never the full thing. The RAG Pipeline — How the AI "Remembers" Your Life RAG stands for Retrieval-Augmented Generation . It's the technique that makes the AI feel like it actually knows you — without sending everything you've ever written to a language model on every request. What is an Embedding? Every diary entry gets converted into a list of numbers — like GPS coordinates for meaning. "I felt anxious today" → 0.21, 0.83, 0.12, 0.74, ... "I was really stressed" → 0.22, 0.81, 0.14, 0.71, ... ← very similar "I love hiking" → 0.91, 0.12, 0.67, 0.23, ... ← very different Similar meaning = similar numbers. This is what makes semantic search work — you search by concept, not exact words. Phase 1 — Writing an Entry You write: "Today was rough. Felt anxious about the deadline." ↓ Ollama nomic-embed-text converts text → 0.21, 0.83, 0.12, 0.74, ... ↓ Saved in SQLite / PostgreSQL: entry text → AES-256-GCM encrypted embedding → stored raw math requires it mood/themes → analyzed by LLM, stored encrypted This happens asynchronously — the entry saves immediately, analysis runs in the background. Phase 2 — Asking a Question You ask: "When did I feel anxious about work?" ↓ Ollama converts question → numbers ↓ Cosine similarity search runs in YOUR database sqlite-vec or pgvector — pure math, no external call entry A: 0.91 match ✓ entry B: 0.87 match ✓ entry C: 0.79 match ✓ entry D: 0.31 match ✗ skipped ↓ Top 5 entries decrypted in memory ↓ LLM receives: system prompt + diary excerpts + your question ↓ Streams answer word by word SSE The key insight: embeddings find what to read. The LLM decides what to say about it. The LLM never sees your full diary — only the 5 most relevant entries. Cosine similarity runs entirely on your server. Nothing goes to an external service unless you've opted into cloud mode. The Companion Pipeline — Safety First The companion mode is built around one rule: if someone is in crisis, the LLM never runs. You type a message ↓ Crisis detection keyword matching, server-side "suicide", "hurt myself", "want to die", etc. ↓ CRISIS? SAFE? ↓ ↓ Hardcoded response LLM runs with CBT/DBT prompt 988 + Crisis Text Acknowledges → reflects → one question Line + findahelpline LLM never called Saves encrypted to companion messages The crisis response is hardcoded. It cannot be hallucinated, modified, or bypassed by a clever prompt. The companion banner — "This is an AI companion, not a licensed therapist" — is also hardcoded in the UI, never AI-generated. The companion system uses a distinct system prompt built around CBT thought-reframing, DBT skills, and reflective listening. Sessions are saved and resumable. A real limitation worth naming: keyword detection catches explicit phrases like "I want to die" but will miss oblique crisis language like "I just want it to stop" or "everyone would be better off without me." A small local classifier as a second layer is on the roadmap — keyword filter as the fast, auditable first line, classifier as the safety net for implicit signals. The Encryption Layer Every piece of user content goes through AES-256-GCM encryption before hitting the database. // Every diary entry, chat message, companion note goes through this encrypt text // before DB insert decrypt text // after DB read, before sending to LLM or browser The encryption key is yours — a 64-character hex string you generate and store in your .env . Without it, the database is unreadable. The server never transmits the key. The one exception: embedding vectors are stored unencrypted. Cosine similarity requires the raw numbers. The chunk text that generated the embedding is stored separately, encrypted. The security boundary lives at the source text, not the derived vector. The Technical Stack Runtime Node.js + Express Frontend Vanilla JS SPA no build step, no framework Auth JWT + Argon2id password hashing Encryption AES-256-GCM Node.js crypto module Storage SQLite local default or PostgreSQL multi-device Vector search sqlite-vec local or pgvector Postgres Embeddings Ollama nomic-embed-text local default LLM Ollama local default / Groq / OpenAI / Gemini / Anthropic Streaming SSE Server-Sent Events over POST with ReadableStream Voice Browser SpeechRecognition API free or Whisper premium The frontend is deliberately no-framework. No React, no build pipeline, no node modules in the browser. It loads instantly and works offline except for cloud LLM calls . LLM Provider Architecture The LLM layer is a thin factory that routes every call to whatever provider is active: js // services/llm.js const PROVIDERS = { ollama, anthropic, openai, gemini, groq }; export const streamChat = history, message, context, onDelta = PROVIDERS getConfig .provider .streamChat history, message, context, onDelta ; Switching providers happens at runtime — no restart needed. Every provider implements the same three-function contract: analyzeEntry text // → { mood, themes, reflection, followUpQuestion } generateText systemPrompt, userMessage // → string streamChat history, message, context, onDelta // → full string, streams via onDelta Groq uses the OpenAI SDK pointed at https://api.groq.com/openai/v1 . Ollama uses the same SDK pointed at http://localhost:11434/v1 . Identical interface, completely different privacy properties. What I Learned 1. Embeddings and LLMs are completely separate concerns. The model that converts text to numbers has nothing to do with the model that generates answers. You can run Ollama for embeddings and Groq for chat simultaneously. Most people conflate the two. 2. 7B–8B models are good enough for structured diary tasks. Mood detection, theme extraction, journaling prompts — a well-prompted qwen2.5:7b handles all of these reliably. The quality gap versus 70B only shows up in long-form weekly summaries. Use format: json mode in Ollama for structured output; without it, small models will eventually return malformed JSON and break your pipeline silently. 3. Cosine similarity belongs in your database, not a vector database. For a personal app with thousands not millions of entries, sqlite-vec and pgvector are more than sufficient. No Pinecone, no Weaviate, no extra infra. The math is simple and fast. 4. SSE over POST is the right call for streaming. The standard advice is to use EventSource , but EventSource is GET-only. Chat requires POST to send the message body . The fix is fetch + ReadableStream on the client — full control over the stream lifecycle, no awkward query-string payloads. 5. Crisis detection must run before the LLM, not inside it. You cannot rely on an LLM to consistently detect crisis language and respond safely. Keyword matching before the LLM call is not elegant, but it is reliable and auditable. An LLM should never be the first line of defense for someone in crisis — it should never even get the message. 6. The hardest engineering decisions in a privacy-first app are about what not to do. No analytics. No telemetry. No "anonymized" usage data. Every one of those is a useful product feature you give up — and giving them up is the point. Try It DiaryGPT is open source. Self-host it, read every line, verify the privacy claims. 🔗 GitHub: https://github.com/rahul70-code/diarygpt https://github.com/rahul70-code/diarygpt Your diary is yours. The AI should work for you, not harvest from you. Stack: Node.js · Ollama · SQLite · AES-256-GCM · Vanilla JS Tags: LLM RAG Privacy LocalFirst OpenSource