{"slug": "how-to-build-a-high-performance-rag-pipeline-with-ollama-python-and-typescript", "title": "How to Build a High-Performance RAG Pipeline with Ollama, Python and TypeScript", "summary": "A developer built a high-performance RAG pipeline using Ollama, Python, and TypeScript that runs entirely locally, eliminating cloud API latency and data compliance issues. The architecture uses Ollama for embedding generation and model inference, with cosine similarity for document retrieval. The guide provides code examples for both TypeScript and Python implementations.", "body_md": "If you need to spin up a local, privacy-first AI agent that can query your own internal documents without sending data to third-party APIs, this guide covers the exact architecture using TypeScript, Python, and Ollama.\n\nTime to complete:~15 minutes.\n\nPrerequisites:Python 3.10+ or Node.js installed, basic familiarity with embeddings.\n\nWhen building production-ready LLM features, relying solely on cloud providers introduces two major friction points: variable API latency and data compliance bottlenecks.\n\nBy shifting the embedding generation and model inference locally, we completely bypass network overhead and keep sensitive data securely inside our infrastructure.\n\nHere is how the data flows through our system:\n\nFirst, ensure you have Ollama running locally and pull the required models. Open your terminal and run:\n\n```\n# Pull the LLM\nollama pull llama3\n\n# Pull the embedding model explicitly\nollama pull nomic-embed-text\n```\n\nChoose your preferred language environment to house the orchestration logic.\n\n``` js\n// index.ts\nimport { Ollama } from 'ollama';\n\nconst ollama = new Ollama({ host: 'http://127.0.0.1:11434' });\n\nasync function generateLocalEmbedding(text: string): Promise<number[]> {\n  const response = await ollama.embeddings({\n    model: 'nomic-embed-text',\n    prompt: text,\n  });\n  return response.embedding;\n}\n```\n\nFirst, install the official client: `pip install ollama`\n\n``` python\n# orchestrator.py\nimport asyncio\nfrom ollama import AsyncClient\n\n# Initialize the asynchronous local client\nclient = AsyncClient(host='http://127.0.0.1:11434')\n\nasync def generate_local_embedding(text: str) -> list[float]:\n    response = await client.embed(\n        model='nomic-embed-text',\n        input=text\n    )\n    # The client returns a list of embedding arrays inside 'embeddings'\n    return response['embeddings'][0]\n```\n\nWhen querying the local vector array, we calculate the similarity score to find the most relevant document chunks.\n\n``` js\nfunction cosineSimilarity(vecA: number[], vecB: number[]): number {\n  const dotProduct = vecA.reduce((sum, a, i) => sum + a * vecB[i], 0);\n  const normA = Math.sqrt(vecA.reduce((sum, a) => sum + a * a, 0));\n  const normB = Math.sqrt(vecB.reduce((sum, b) => sum + b * b, 0));\n  return dotProduct / (normA * normB);\n}\nphp\nimport math\n\ndef cosine_similarity(vec_a: list[float], vec_b: list[float]) -> float:\n    dot_product = sum(a * b for a, b in zip(vec_a, vec_b))\n    norm_a = math.sqrt(sum(a * a for a in vec_a))\n    norm_b = math.sqrt(sum(b * b for b in vec_b))\n\n    if not norm_a or not norm_b:\n        return 0.0  # Prevent division by zero\n\n    return dot_product / (norm_a * norm_b)\n```\n\nBuilding local agentic workflows gives you complete control over your data lifecycle and cuts API bills down to zero.\n\nLet me know in the comments below: **Are you running your LLMs locally or sticking to cloud APIs for production?**\n\nThis tutorial on [Building Local RAG with Ollama](https://www.youtube.com/watch?v=IM6bq3BUQAI) provides an excellent visual look at parsing document chunks and handling embedding shapes using the official python libraries we integrated into the text.", "url": "https://wpnews.pro/news/how-to-build-a-high-performance-rag-pipeline-with-ollama-python-and-typescript", "canonical_source": "https://dev.to/ussdlover/how-to-build-a-high-performance-rag-pipeline-with-ollama-python-and-typescript-320h", "published_at": "2026-06-14 19:42:18+00:00", "updated_at": "2026-06-14 20:10:55.316681+00:00", "lang": "en", "topics": ["large-language-models", "artificial-intelligence", "developer-tools", "machine-learning"], "entities": ["Ollama", "Python", "TypeScript", "Llama 3", "nomic-embed-text"], "alternates": {"html": "https://wpnews.pro/news/how-to-build-a-high-performance-rag-pipeline-with-ollama-python-and-typescript", "markdown": "https://wpnews.pro/news/how-to-build-a-high-performance-rag-pipeline-with-ollama-python-and-typescript.md", "text": "https://wpnews.pro/news/how-to-build-a-high-performance-rag-pipeline-with-ollama-python-and-typescript.txt", "jsonld": "https://wpnews.pro/news/how-to-build-a-high-performance-rag-pipeline-with-ollama-python-and-typescript.jsonld"}}