{"slug": "how-i-rescued-a-rag-assistant-from-memory-leaks-and-got-it-running-on-a-512mb", "title": "How I rescued a RAG assistant from memory leaks and got it running on a 512MB RAM free tier", "summary": "A developer rescued a Retrieval-Augmented Generation (RAG) assistant from memory leaks and deployed it on a Render free-tier instance with a 512MB RAM limit. The engineer re-engineered the pipeline by replacing heavy PyTorch models with FastEmbed, baking models into Docker images, implementing hybrid search, and setting up automated evaluations with MLflow. The resulting production-grade assistant is specifically optimized for German manufacturing compliance and speed requirements.", "body_md": "A few weeks ago, I had a classic \"works on my machine\" moment. I had built a nice RAG prototype locally using Ollama and PyTorch. But when I tried to deploy it for staging on a Render free-tier instance (which has a brutal 512MB RAM limit), the server instantly crashed with Out-Of-Memory (OOM) errors. This post is a step-by-step breakdown of how I re-engineered the pipeline—moving from heavy PyTorch models to FastEmbed, baking models into Docker images, implementing hybrid search, and setting up automated evaluations with MLflow—to get a production-ready RAG assistant live.\n\nIn the industrial domain, AI holds massive promise. In Germany's heavy manufacturing sector—spanning giants like Siemens, Bosch, and BMW—accessing the right maintenance instructions quickly can mean the difference between a minor schedule adjustments and a multi-million-euro line stoppage. However, applying standard Academic Retrieval-Augmented Generation (RAG) directly to complex technical manuals typically fails.\n\nThis article details how I transformed a broken, slow RAG prototype into a hardened, high-performance, production-grade assistant specifically optimized for German manufacturing compliance and speed requirements.\n\nStandard RAG pipelines follow a basic procedure: chunk a document, run standard vector search, pass top chunks to an LLM, and output the result.\n\nWhen applied to a **200-page compressor manual**, this naive approach collapses due to three factors:\n\nTo solve these challenges, I built a multi-stage retrieval and generation architecture using **LlamaIndex**, **Qdrant**, and **Mistral-7B**.\n\n``` php\ngraph TD\n    Query[User Query] -->|HyDE Transformation| HyDE[Hypothetical Doc]\n    HyDE -->|Dense Search| VectorStore[(Qdrant Vector Store)]\n    Query -->|Keyword Search| BM25[BM25 Retriever]\n    VectorStore -->|Top K Chunks| RRF[RRF Hybrid Fusion]\n    BM25 -->|Top K Chunks| RRF\n    RRF -->|Combined Chunks| Reranker[Cross-Encoder Reranker]\n    Reranker -->|Top 3 Chunks| Deduplicator[SHA-256 Deduplication]\n    Deduplicator -->|Ground Truth Chunks| LLM[Mistral-7B Generator]\n    LLM -->|Stream Response| Response[SSE Stream Client]\n```\n\nTechnical queries can be highly variable. A technician might ask *\"What should be done if the compressor's high-pressure warning transducer value approaches the limit?\"* while the manual describes the issue using passive engineering specifications.\n\nI implemented **Hypothetical Document Embeddings (HyDE)**. The user's query is passed to the LLM to generate a hypothetical \"ideal\" answer. This hypothetical answer, rich in technical syntax, is then embedded and used for dense vector search, drastically increasing our retrieval recall.\n\nVector search (dense retrieval) is excellent for conceptual matching but struggles with specific numbers or parts (e.g., \"5 kW\", \"Model-X\").\n\nI built a **Hybrid Retriever** combining dense vector search (via Qdrant) and sparse keyword retrieval (BM25). The results from both retrievers are merged using **Reciprocal Rank Fusion (RRF)**:\n\n$$RRF(d) = \\sum_{m \\in M} \\frac{1}{k + r_m(d)}$$\n\nwhere $k = 60$ is a constant, and $r_m(d)$ is the rank of document $d$ in retriever $m$. This fuses semantic alignment with exact keyword precision.\n\nRetrieving 6-10 chunks covers the necessary context but introduces noise and consumes precious context window tokens, slowing down LLM generation.\n\nI integrated a custom **Cross-Encoder Reranker** (`ms-marco-MiniLM-L-6-v2`\n\n). While Bi-encoders (like BGE) embed queries and documents separately, a Cross-Encoder performs full self-attention over the query and chunk simultaneously, scoring their precise relationship. This allows us to reduce our context from 6 down to the top 3 highly relevant chunks without losing critical facts.\n\nIn manuals, certain tables or notices (such as safety warnings) repeat on multiple pages. Fusing duplicate chunks wastes context capacity and creates repetitive LLM answers.\n\nI implemented a postprocessor that normalizes chunk text and deduplicates based on a normalized **SHA-256 hash** and Jaccard text similarity (threshold = 0.85).\n\nYou cannot optimize what you do not measure. Rather than relying on sporadic manual \"vibe checks,\" I established a rigorous, automated **LLM-as-a-Judge** evaluation loop using **RAGAS** and **MLflow**.\n\nI curated a production-grade evaluation dataset of **50+ Q&A pairs** directly from real industrial manuals, distributed across:\n\n`num_ctx`\n\n)\nDuring baseline evaluations, I noticed a critical bottleneck: the local Mistral model was hallucinating safety regulations because of context window truncation.\n\nI designed an experiment comparing `num_ctx`\n\nwindow sizes:\n\nContext Window (`num_ctx` ) |\nFaithfulness | Context Recall | p95 Latency | Status |\n|---|---|---|---|---|\n512 (Baseline) |\n0.583 | 0.554 | ~1.9s |\n⚠️ High context truncation |\n2048 (Optimal) |\n0.724 |\n0.712 |\n~3.2s | ✅ Low truncation, high accuracy |\n4096 (Wasteful) |\n0.731 | 0.718 | ~5.9s | ❌ Too slow for production |\n\nBy moving to ** num_ctx: 2048**, the retrieved context fit perfectly, boosting\n\nTo transition from a developer script to a production service, I re-engineered the FastAPI web service to support high concurrency, real-time streaming, and robust security.\n\nStandard python web apps block on I/O. I rewrote all FastAPI endpoints to be fully async. I pooled the remote `QdrantClient`\n\nthread-safely via a global singleton and instantiated an `AsyncQdrantClient`\n\nconnection pool, ensuring concurrent database handles are shared efficiently.\n\nTo achieve a p95 latency under the strict **2.0-second limit**, I implemented two layers of caching:\n\n`BGEEmbedder`\n\nto cache calculated query embedding vectors in a local LRU cache, preventing repetitive tensor computations.For long-running generations, keeping a user waiting for a full payload ruins the experience. I created the `/query/stream`\n\nendpoint returning a real-time token stream using **Server-Sent Events (SSE)**. The UI immediately renders the text delta as it generates.\n\nTo secure the public endpoint, I built:\n\n`X-API-Key`\n\nvalidation check on all sensitive endpoints.`Retry-After`\n\nheaders to prevent resource exhaustion.To guarantee \"it works on my machine\" translates perfectly to a cloud environment, I containerized the entire pipeline.\n\n`UID=1000`\n\n) and includes a strict health check monitoring local API latency.This project demonstrates the transition from a simple machine learning model to a robust, compliant enterprise-grade system:", "url": "https://wpnews.pro/news/how-i-rescued-a-rag-assistant-from-memory-leaks-and-got-it-running-on-a-512mb", "canonical_source": "https://dev.to/shaikhadibbb/how-i-rescued-a-rag-assistant-from-memory-leaks-and-got-it-running-on-a-512mb-ram-free-tier-4co9", "published_at": "2026-05-29 09:02:07+00:00", "updated_at": "2026-05-29 09:11:28.167079+00:00", "lang": "en", "topics": ["artificial-intelligence", "machine-learning", "large-language-models", "generative-ai", "mlops"], "entities": ["Ollama", "PyTorch", "Render", "FastEmbed", "MLflow", "Siemens", "Bosch", "BMW"], "alternates": {"html": "https://wpnews.pro/news/how-i-rescued-a-rag-assistant-from-memory-leaks-and-got-it-running-on-a-512mb", "markdown": "https://wpnews.pro/news/how-i-rescued-a-rag-assistant-from-memory-leaks-and-got-it-running-on-a-512mb.md", "text": "https://wpnews.pro/news/how-i-rescued-a-rag-assistant-from-memory-leaks-and-got-it-running-on-a-512mb.txt", "jsonld": "https://wpnews.pro/news/how-i-rescued-a-rag-assistant-from-memory-leaks-and-got-it-running-on-a-512mb.jsonld"}}