{"slug": "building-a-local-only-rag-system-with-ollama-and-typescript", "title": "Building a Local-Only RAG System with Ollama and TypeScript", "summary": "A developer built a fully local Retrieval-Augmented Generation (RAG) system using Ollama and TypeScript, requiring no API keys or third-party calls. The 200-line command-line tool indexes `.md` and `.txt` files into a SQLite vector store using `sqlite-vec`, then answers natural language questions via local embedding and language models. The system keeps all data on the user's machine, with SQLite outperforming Chroma or Qdrant for collections under a million chunks.", "body_md": "Most RAG tutorials send your private documents to OpenAI. Here's how to keep them on your laptop.\n\nThis post walks through a complete Retrieval-Augmented Generation pipeline that runs entirely on your machine. No API keys, no third-party calls, no monthly bill. Two hundred lines of TypeScript and a single binary.\n\nA command-line tool that:\n\n`.md`\n\nor `.txt`\n\nfiles into a local vector store.By the end, you'll be able to point it at your engineering wiki, your personal notes, or your codebase, and ask questions in natural language without anything leaving your machine.\n\n`@xenova/transformers`\n\n`sqlite-vec`\n\nWhy SQLite over Chroma or Qdrant? For collections under a million chunks, SQLite is faster, simpler to deploy, and doesn't need a daemon. Your \"vector database\" is one file.\n\n```\nollama pull nomic-embed-text       # the embedding model\nollama pull qwen2.5:7b             # the answer model\npnpm add better-sqlite3 sqlite-vec\npython\nimport fs from \"node:fs\";\nimport path from \"node:path\";\n\nfunction chunk(text: string, size = 800, overlap = 100): string[] {\n  const sentences = text.split(/(?<=[.!?])\\s+/);\n  const chunks: string[] = [];\n  let buffer = \"\";\n  for (const s of sentences) {\n    if ((buffer + \" \" + s).length > size && buffer) {\n      chunks.push(buffer.trim());\n      buffer = buffer.slice(-overlap) + \" \" + s;\n    } else {\n      buffer = buffer ? buffer + \" \" + s : s;\n    }\n  }\n  if (buffer) chunks.push(buffer.trim());\n  return chunks;\n}\n\nasync function embed(text: string): Promise<number[]> {\n  const r = await fetch(\"http://localhost:11434/api/embeddings\", {\n    method: \"POST\",\n    body: JSON.stringify({ model: \"nomic-embed-text\", prompt: text }),\n  });\n  const json = await r.json();\n  return json.embedding;\n}\n```\n\n`nomic-embed-text`\n\nreturns 768-dimensional vectors. Fast enough that you can re-index a thousand-document corpus in a few minutes.\n\n``` python\nimport Database from \"better-sqlite3\";\nimport * as sqliteVec from \"sqlite-vec\";\n\nconst db = new Database(\"rag.db\");\nsqliteVec.load(db);\n\ndb.exec(`\n  CREATE TABLE IF NOT EXISTS chunks (\n    id INTEGER PRIMARY KEY,\n    source TEXT NOT NULL,\n    content TEXT NOT NULL\n  );\n  CREATE VIRTUAL TABLE IF NOT EXISTS vec_chunks USING vec0(\n    id INTEGER PRIMARY KEY,\n    embedding FLOAT[768]\n  );\n`);\n\nasync function indexFile(filePath: string) {\n  const text = fs.readFileSync(filePath, \"utf8\");\n  const pieces = chunk(text);\n  for (const piece of pieces) {\n    const insertChunk = db.prepare(\n      \"INSERT INTO chunks (source, content) VALUES (?, ?)\"\n    );\n    const result = insertChunk.run(filePath, piece);\n    const vec = await embed(piece);\n    db.prepare(\n      \"INSERT INTO vec_chunks (id, embedding) VALUES (?, ?)\"\n    ).run(result.lastInsertRowid, JSON.stringify(vec));\n  }\n}\njs\nasync function search(query: string, k = 4) {\n  const queryVec = await embed(query);\n  const rows = db.prepare(`\n    SELECT chunks.source, chunks.content, vec_chunks.distance\n    FROM vec_chunks\n    JOIN chunks ON chunks.id = vec_chunks.id\n    WHERE vec_chunks.embedding MATCH ?\n    ORDER BY distance\n    LIMIT ?\n  `).all(JSON.stringify(queryVec), k) as Array<{\n    source: string;\n    content: string;\n    distance: number;\n  }>;\n  return rows;\n}\n```\n\n`MATCH`\n\ntriggers `sqlite-vec`\n\n's cosine similarity. Sub-millisecond on small corpora.\n\n``` js\nasync function ask(question: string) {\n  const matches = await search(question, 4);\n\n  const context = matches\n    .map((m, i) => `[${i + 1}] ${m.source}\\n${m.content}`)\n    .join(\"\\n\\n---\\n\\n\");\n\n  const prompt = `Answer the question using only the context provided.\nIf the answer is not in the context, say so.\nCite sources by their number in square brackets.\n\nCONTEXT:\n${context}\n\nQUESTION: ${question}\n\nANSWER:`;\n\n  const r = await fetch(\"http://localhost:11434/v1/chat/completions\", {\n    method: \"POST\",\n    body: JSON.stringify({\n      model: \"qwen2.5:7b\",\n      messages: [{ role: \"user\", content: prompt }],\n      stream: false,\n    }),\n  });\n  const json = await r.json();\n  return {\n    answer: json.choices[0].message.content,\n    sources: matches.map((m) => m.source),\n  };\n}\njs\n// Index a folder\nconst files = fs.readdirSync(\"./notes\").map((f) => path.join(\"./notes\", f));\nfor (const f of files) await indexFile(f);\n\n// Ask\nconst result = await ask(\"What did we decide about the auth refactor?\");\nconsole.log(result.answer);\nconsole.log(\"Sources:\", result.sources);\n```\n\nTotal runtime, indexing 500 markdown files: about three minutes on an M2 MacBook. Per-question latency: under two seconds.\n\nIf your team's documentation has grown past the point where anyone reads it cover to cover (about a hundred pages), local RAG turns that wiki back into something useful. Same applies to:\n\nLast bullet matters: every legal-tech startup right now is building a cloud version of this. Yours runs on your laptop.\n\nThe previous post in this series covered function calling. Combining function calling with RAG gives you a local agent that can read your documents and take actions: \"draft an email to legal summarising what our MSA says about data residency\" — read MSA chunks, compose draft, call the email tool.\n\nThat's a real assistant. And nothing leaves your machine.\n\nNext post: streaming Ollama responses through Server-Sent Events in Next.js, the production pattern for live UIs.", "url": "https://wpnews.pro/news/building-a-local-only-rag-system-with-ollama-and-typescript", "canonical_source": "https://dev.to/pavelespitia/building-a-local-only-rag-system-with-ollama-and-typescript-430c", "published_at": "2026-05-25 14:47:05+00:00", "updated_at": "2026-05-25 15:06:04.766985+00:00", "lang": "en", "topics": ["large-language-models", "natural-language-processing", "ai-tools", "ai-infrastructure"], "entities": ["Ollama", "TypeScript", "SQLite", "Chroma", "Qdrant", "nomic-embed-text", "qwen2.5:7b", "xenova/transformers"], "alternates": {"html": "https://wpnews.pro/news/building-a-local-only-rag-system-with-ollama-and-typescript", "markdown": "https://wpnews.pro/news/building-a-local-only-rag-system-with-ollama-and-typescript.md", "text": "https://wpnews.pro/news/building-a-local-only-rag-system-with-ollama-and-typescript.txt", "jsonld": "https://wpnews.pro/news/building-a-local-only-rag-system-with-ollama-and-typescript.jsonld"}}