How to Build a High-Performance RAG Pipeline with Ollama, Python and TypeScript

wpnews.pro

cd /news/large-language-models/how-to-build-a-high-performance-rag-… · home › topics › large-language-models › article

[ARTICLE · art-27219] src=dev.to ↗ pub=2026-06-14T19:42Z topic=large-language-models verified=true sentiment=↑ positive

How to Build a High-Performance RAG Pipeline with Ollama, Python and TypeScript

A developer built a high-performance RAG pipeline using Ollama, Python, and TypeScript that runs entirely locally, eliminating cloud API latency and data compliance issues. The architecture uses Ollama for embedding generation and model inference, with cosine similarity for document retrieval. The guide provides code examples for both TypeScript and Python implementations.

read2 min views25 publishedJun 14, 2026

If you need to spin up a local, privacy-first AI agent that can query your own internal documents without sending data to third-party APIs, this guide covers the exact architecture using TypeScript, Python, and Ollama.

Time to complete:~15 minutes.

Prerequisites:Python 3.10+ or Node.js installed, basic familiarity with embeddings.

When building production-ready LLM features, relying solely on cloud providers introduces two major friction points: variable API latency and data compliance bottlenecks.

By shifting the embedding generation and model inference locally, we completely bypass network overhead and keep sensitive data securely inside our infrastructure.

Here is how the data flows through our system:

First, ensure you have Ollama running locally and pull the required models. Open your terminal and run:

ollama pull llama3

ollama pull nomic-embed-text

Choose your preferred language environment to house the orchestration logic.

// index.ts
import { Ollama } from 'ollama';

const ollama = new Ollama({ host: 'http://127.0.0.1:11434' });

async function generateLocalEmbedding(text: string): Promise<number[]> {
  const response = await ollama.embeddings({
    model: 'nomic-embed-text',
    prompt: text,
  });
  return response.embedding;
}

First, install the official client: pip install ollama

import asyncio
from ollama import AsyncClient

client = AsyncClient(host='http://127.0.0.1:11434')

async def generate_local_embedding(text: str) -> list[float]:
    response = await client.embed(
        model='nomic-embed-text',
        input=text
    )
    return response['embeddings'][0]

When querying the local vector array, we calculate the similarity score to find the most relevant document chunks.

function cosineSimilarity(vecA: number[], vecB: number[]): number {
  const dotProduct = vecA.reduce((sum, a, i) => sum + a * vecB[i], 0);
  const normA = Math.sqrt(vecA.reduce((sum, a) => sum + a * a, 0));
  const normB = Math.sqrt(vecB.reduce((sum, b) => sum + b * b, 0));
  return dotProduct / (normA * normB);
}
php
import math

def cosine_similarity(vec_a: list[float], vec_b: list[float]) -> float:
    dot_product = sum(a * b for a, b in zip(vec_a, vec_b))
    norm_a = math.sqrt(sum(a * a for a in vec_a))
    norm_b = math.sqrt(sum(b * b for b in vec_b))

    if not norm_a or not norm_b:
        return 0.0  # Prevent division by zero

    return dot_product / (norm_a * norm_b)

Building local agentic workflows gives you complete control over your data lifecycle and cuts API bills down to zero.

Let me know in the comments below: Are you running your LLMs locally or sticking to cloud APIs for production?

This tutorial on Building Local RAG with Ollama provides an excellent visual look at parsing document chunks and handling embedding shapes using the official python libraries we integrated into the text.

source & further reading

dev.to — original article Why Your RAG Pipeline is Lying to You LLM TRADER BOT Your AI Subagents Are Lying to You: 4 Silent Failure Modes

~/api · this article 200

$curl api.wpnews.pro/v1/news/how-to-build-a-high-perf…

Read original on dev.to → dev.to/ussdlover/how-to-build-a-high-performance…

mentioned entities

Ollama

Python

TypeScript

Llama 3

nomic-embed-text

metadata

slughow-to-build-a-high-performance-rag-pipeline-with-ollama-python-and-typescript

topic#large-language-models

secondary3 topics

sentimentpositive

canonicaldev.to

navigation

← prevAI's safety gap has led to new G…

next →Computex 2026 Highlights AI PC H…

── more in #large-language-models 4 stories · sorted by recency

promptcube3.com · 29 Jul · #large-language-models

Continue extension setup, Qwen Coder local setup

machinelearningmastery.com · 29 Jul · #large-language-models

Ollama vs. LM Studio vs. llama.cpp: Which Local AI Runtime Should You Use in 2026?

letsdatascience.com · 29 Jul · #large-language-models

OpenAI Publishes Codex Security CLI and SDK Under Apache 2.0

dev.to · 28 Jul · #large-language-models

Building Local AI Agents in Java with Tools4AI and Ollama: An Insurance Claims Use Case

── more on @ollama 3 stories trending now

wpnews · 29 Jul · #ai-safety

News Summary for July 29, 2026

wpnews · 28 Jul · #large-language-models

How to Download and Run Kimi K3 Open Weights

wpnews · 29 Jul · #artificial-intelligence

Investors are selling Meta as it heads to its earnings report

sponsored brought to you by zahid.host 4,200+ EU-deployed projects

reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main

→ Live at https://your-agent.zahid.host ✓

Get free account → Pricing

from €0/mo · no card required