cd /news/artificial-intelligence/how-i-rescued-a-rag-assistant-from-m… · home topics artificial-intelligence article
[ARTICLE · art-17399] src=dev.to pub= topic=artificial-intelligence verified=true sentiment=↑ positive

How I rescued a RAG assistant from memory leaks and got it running on a 512MB RAM free tier

A developer rescued a Retrieval-Augmented Generation (RAG) assistant from memory leaks and deployed it on a Render free-tier instance with a 512MB RAM limit. The engineer re-engineered the pipeline by replacing heavy PyTorch models with FastEmbed, baking models into Docker images, implementing hybrid search, and setting up automated evaluations with MLflow. The resulting production-grade assistant is specifically optimized for German manufacturing compliance and speed requirements.

read5 min publishedMay 29, 2026

A few weeks ago, I had a classic "works on my machine" moment. I had built a nice RAG prototype locally using Ollama and PyTorch. But when I tried to deploy it for staging on a Render free-tier instance (which has a brutal 512MB RAM limit), the server instantly crashed with Out-Of-Memory (OOM) errors. This post is a step-by-step breakdown of how I re-engineered the pipeline—moving from heavy PyTorch models to FastEmbed, baking models into Docker images, implementing hybrid search, and setting up automated evaluations with MLflow—to get a production-ready RAG assistant live.

In the industrial domain, AI holds massive promise. In Germany's heavy manufacturing sector—spanning giants like Siemens, Bosch, and BMW—accessing the right maintenance instructions quickly can mean the difference between a minor schedule adjustments and a multi-million-euro line stoppage. However, applying standard Academic Retrieval-Augmented Generation (RAG) directly to complex technical manuals typically fails.

This article details how I transformed a broken, slow RAG prototype into a hardened, high-performance, production-grade assistant specifically optimized for German manufacturing compliance and speed requirements.

Standard RAG pipelines follow a basic procedure: chunk a document, run standard vector search, pass top chunks to an LLM, and output the result.

When applied to a 200-page compressor manual, this naive approach collapses due to three factors:

To solve these challenges, I built a multi-stage retrieval and generation architecture using LlamaIndex, Qdrant, and Mistral-7B.

graph TD
    Query[User Query] -->|HyDE Transformation| HyDE[Hypothetical Doc]
    HyDE -->|Dense Search| VectorStore[(Qdrant Vector Store)]
    Query -->|Keyword Search| BM25[BM25 Retriever]
    VectorStore -->|Top K Chunks| RRF[RRF Hybrid Fusion]
    BM25 -->|Top K Chunks| RRF
    RRF -->|Combined Chunks| Reranker[Cross-Encoder Reranker]
    Reranker -->|Top 3 Chunks| Deduplicator[SHA-256 Deduplication]
    Deduplicator -->|Ground Truth Chunks| LLM[Mistral-7B Generator]
    LLM -->|Stream Response| Response[SSE Stream Client]

Technical queries can be highly variable. A technician might ask "What should be done if the compressor's high-pressure warning transducer value approaches the limit?" while the manual describes the issue using passive engineering specifications.

I implemented Hypothetical Document Embeddings (HyDE). The user's query is passed to the LLM to generate a hypothetical "ideal" answer. This hypothetical answer, rich in technical syntax, is then embedded and used for dense vector search, drastically increasing our retrieval recall.

Vector search (dense retrieval) is excellent for conceptual matching but struggles with specific numbers or parts (e.g., "5 kW", "Model-X").

I built a Hybrid Retriever combining dense vector search (via Qdrant) and sparse keyword retrieval (BM25). The results from both retrievers are merged using Reciprocal Rank Fusion (RRF):

$$RRF(d) = \sum_{m \in M} \frac{1}{k + r_m(d)}$$

where $k = 60$ is a constant, and $r_m(d)$ is the rank of document $d$ in retriever $m$. This fuses semantic alignment with exact keyword precision.

Retrieving 6-10 chunks covers the necessary context but introduces noise and consumes precious context window tokens, slowing down LLM generation.

I integrated a custom Cross-Encoder Reranker (ms-marco-MiniLM-L-6-v2

). While Bi-encoders (like BGE) embed queries and documents separately, a Cross-Encoder performs full self-attention over the query and chunk simultaneously, scoring their precise relationship. This allows us to reduce our context from 6 down to the top 3 highly relevant chunks without losing critical facts.

In manuals, certain tables or notices (such as safety warnings) repeat on multiple pages. Fusing duplicate chunks wastes context capacity and creates repetitive LLM answers.

I implemented a postprocessor that normalizes chunk text and deduplicates based on a normalized SHA-256 hash and Jaccard text similarity (threshold = 0.85).

You cannot optimize what you do not measure. Rather than relying on sporadic manual "vibe checks," I established a rigorous, automated LLM-as-a-Judge evaluation loop using RAGAS and MLflow.

I curated a production-grade evaluation dataset of 50+ Q&A pairs directly from real industrial manuals, distributed across:

num_ctx

) During baseline evaluations, I noticed a critical bottleneck: the local Mistral model was hallucinating safety regulations because of context window truncation.

I designed an experiment comparing num_ctx

window sizes:

Context Window (num_ctx ) | Faithfulness | Context Recall | p95 Latency | Status | |---|---|---|---|---| 512 (Baseline) | 0.583 | 0.554 | ~1.9s | ⚠️ High context truncation | 2048 (Optimal) | 0.724 | 0.712 | ~3.2s | ✅ Low truncation, high accuracy | 4096 (Wasteful) | 0.731 | 0.718 | ~5.9s | ❌ Too slow for production |

By moving to ** num_ctx: 2048**, the retrieved context fit perfectly, boosting

To transition from a developer script to a production service, I re-engineered the FastAPI web service to support high concurrency, real-time streaming, and robust security.

Standard python web apps block on I/O. I rewrote all FastAPI endpoints to be fully async. I pooled the remote QdrantClient

thread-safely via a global singleton and instantiated an AsyncQdrantClient

connection pool, ensuring concurrent database handles are shared efficiently.

To achieve a p95 latency under the strict 2.0-second limit, I implemented two layers of caching:

BGEEmbedder

to cache calculated query embedding vectors in a local LRU cache, preventing repetitive tensor computations.For long-running generations, keeping a user waiting for a full payload ruins the experience. I created the /query/stream

endpoint returning a real-time token stream using Server-Sent Events (SSE). The UI immediately renders the text delta as it generates.

To secure the public endpoint, I built:

X-API-Key

validation check on all sensitive endpoints.Retry-After

headers to prevent resource exhaustion.To guarantee "it works on my machine" translates perfectly to a cloud environment, I containerized the entire pipeline.

UID=1000

) and includes a strict health check monitoring local API latency.This project demonstrates the transition from a simple machine learning model to a robust, compliant enterprise-grade system:

── more in #artificial-intelligence 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/how-i-rescued-a-rag-…] indexed:0 read:5min 2026-05-29 ·