How I rescued a RAG assistant from memory leaks and got it running on a 512MB RAM free tier

A developer rescued a Retrieval-Augmented Generation (RAG) assistant from memory leaks and deployed it on a Render free-tier instance with a 512MB RAM limit. The engineer re-engineered the pipeline by replacing heavy PyTorch models with FastEmbed, baking models into Docker images, implementing hybrid search, and setting up automated evaluations with MLflow. The resulting production-grade assistant is specifically optimized for German manufacturing compliance and speed requirements.

A few weeks ago, I had a classic "works on my machine" moment. I had built a nice RAG prototype locally using Ollama and PyTorch. But when I tried to deploy it for staging on a Render free-tier instance which has a brutal 512MB RAM limit , the server instantly crashed with Out-Of-Memory OOM errors. This post is a step-by-step breakdown of how I re-engineered the pipeline—moving from heavy PyTorch models to FastEmbed, baking models into Docker images, implementing hybrid search, and setting up automated evaluations with MLflow—to get a production-ready RAG assistant live. In the industrial domain, AI holds massive promise. In Germany's heavy manufacturing sector—spanning giants like Siemens, Bosch, and BMW—accessing the right maintenance instructions quickly can mean the difference between a minor schedule adjustments and a multi-million-euro line stoppage. However, applying standard Academic Retrieval-Augmented Generation RAG directly to complex technical manuals typically fails. This article details how I transformed a broken, slow RAG prototype into a hardened, high-performance, production-grade assistant specifically optimized for German manufacturing compliance and speed requirements. Standard RAG pipelines follow a basic procedure: chunk a document, run standard vector search, pass top chunks to an LLM, and output the result. When applied to a 200-page compressor manual , this naive approach collapses due to three factors: To solve these challenges, I built a multi-stage retrieval and generation architecture using LlamaIndex , Qdrant , and Mistral-7B . php graph TD Query User Query -- |HyDE Transformation| HyDE Hypothetical Doc HyDE -- |Dense Search| VectorStore Qdrant Vector Store Query -- |Keyword Search| BM25 BM25 Retriever VectorStore -- |Top K Chunks| RRF RRF Hybrid Fusion BM25 -- |Top K Chunks| RRF RRF -- |Combined Chunks| Reranker Cross-Encoder Reranker Reranker -- |Top 3 Chunks| Deduplicator SHA-256 Deduplication Deduplicator -- |Ground Truth Chunks| LLM Mistral-7B Generator LLM -- |Stream Response| Response SSE Stream Client Technical queries can be highly variable. A technician might ask "What should be done if the compressor's high-pressure warning transducer value approaches the limit?" while the manual describes the issue using passive engineering specifications. I implemented Hypothetical Document Embeddings HyDE . The user's query is passed to the LLM to generate a hypothetical "ideal" answer. This hypothetical answer, rich in technical syntax, is then embedded and used for dense vector search, drastically increasing our retrieval recall. Vector search dense retrieval is excellent for conceptual matching but struggles with specific numbers or parts e.g., "5 kW", "Model-X" . I built a Hybrid Retriever combining dense vector search via Qdrant and sparse keyword retrieval BM25 . The results from both retrievers are merged using Reciprocal Rank Fusion RRF : $$RRF d = \sum {m \in M} \frac{1}{k + r m d }$$ where $k = 60$ is a constant, and $r m d $ is the rank of document $d$ in retriever $m$. This fuses semantic alignment with exact keyword precision. Retrieving 6-10 chunks covers the necessary context but introduces noise and consumes precious context window tokens, slowing down LLM generation. I integrated a custom Cross-Encoder Reranker ms-marco-MiniLM-L-6-v2 . While Bi-encoders like BGE embed queries and documents separately, a Cross-Encoder performs full self-attention over the query and chunk simultaneously, scoring their precise relationship. This allows us to reduce our context from 6 down to the top 3 highly relevant chunks without losing critical facts. In manuals, certain tables or notices such as safety warnings repeat on multiple pages. Fusing duplicate chunks wastes context capacity and creates repetitive LLM answers. I implemented a postprocessor that normalizes chunk text and deduplicates based on a normalized SHA-256 hash and Jaccard text similarity threshold = 0.85 . You cannot optimize what you do not measure. Rather than relying on sporadic manual "vibe checks," I established a rigorous, automated LLM-as-a-Judge evaluation loop using RAGAS and MLflow . I curated a production-grade evaluation dataset of 50+ Q&A pairs directly from real industrial manuals, distributed across: num ctx During baseline evaluations, I noticed a critical bottleneck: the local Mistral model was hallucinating safety regulations because of context window truncation. I designed an experiment comparing num ctx window sizes: Context Window num ctx | Faithfulness | Context Recall | p95 Latency | Status | |---|---|---|---|---| 512 Baseline | 0.583 | 0.554 | ~1.9s | ⚠️ High context truncation | 2048 Optimal | 0.724 | 0.712 | ~3.2s | ✅ Low truncation, high accuracy | 4096 Wasteful | 0.731 | 0.718 | ~5.9s | ❌ Too slow for production | By moving to num ctx: 2048 , the retrieved context fit perfectly, boosting To transition from a developer script to a production service, I re-engineered the FastAPI web service to support high concurrency, real-time streaming, and robust security. Standard python web apps block on I/O. I rewrote all FastAPI endpoints to be fully async. I pooled the remote QdrantClient thread-safely via a global singleton and instantiated an AsyncQdrantClient connection pool, ensuring concurrent database handles are shared efficiently. To achieve a p95 latency under the strict 2.0-second limit , I implemented two layers of caching: BGEEmbedder to cache calculated query embedding vectors in a local LRU cache, preventing repetitive tensor computations.For long-running generations, keeping a user waiting for a full payload ruins the experience. I created the /query/stream endpoint returning a real-time token stream using Server-Sent Events SSE . The UI immediately renders the text delta as it generates. To secure the public endpoint, I built: X-API-Key validation check on all sensitive endpoints. Retry-After headers to prevent resource exhaustion.To guarantee "it works on my machine" translates perfectly to a cloud environment, I containerized the entire pipeline. UID=1000 and includes a strict health check monitoring local API latency.This project demonstrates the transition from a simple machine learning model to a robust, compliant enterprise-grade system: