Building a RAG Pipeline From Scratch: What SmartQueue Taught Me About Retrieval

A developer built a RAG pipeline for SmartQueue, a Go-based distributed task queue for IT support tickets, using BM25 search instead of vector search to ground LLM responses in internal knowledge. The pipeline checks for prompt injection, retrieves top matches from 10 runbooks via BM25, and streams answers from Groq's LLaMA 3.3 70B model. The developer replaced ChromaDB with a 50-line BM25 implementation to avoid deployment issues in a resource-constrained container on Hugging Face Spaces.

When I set out to add an AI assistant to SmartQueue, a distributed task queue I'd already built in Go for handling IT support tickets, the obvious move was to bolt on an LLM and call it done. Type a question, get an answer. But a generic LLM doesn't know your company's password reset procedure, your P1 outage runbook, or that refunds need manager approval above $500. It needed grounding in actual internal knowledge. That's the job retrieval-augmented generation RAG is built for: pull the relevant facts out of your own documents first, then hand them to the model as context instead of trusting it to know your business. This post walks through how that pipeline actually works, the architectural decision I reversed midway through and why , the numbers I picked for things like retrieval depth and temperature, and an honest take on whether any of it counts as "real" RAG. SmartQueue Bot lives inside the Queue Health and AI Bot tabs of the dashboard. An agent picks a ticket, asks a question like "what are the immediate steps for this database outage," and the bot streams back an answer token by token, grounded in a small internal knowledge base of IT runbooks. The request flow looks like this: agent question | v prompt-injection check regex guardrails | v BM25 search over 10 runbooks -- top 4 matches | v system prompt assembled: ticket context + runbook excerpts | v Groq LLaMA 3.3 70B streamed via SSE, with last 10 turns of session history | v response streamed to client + written back to Redis session memory Three things happen before any text reaches the model: the user's message is checked for prompt injection attempts, the message is used as a query against the knowledge base, and the top matches get woven into a system prompt alongside the ticket's category, priority, and description. The model never sees raw documents without that framing. It sees a structured brief. The decision I reversed: ChromaDB, then BM25 The first version of the knowledge base used ChromaDB with its default ONNX embedding function: proper vector search, no torch dependency, queried through a thread pool so it wouldn't block the event loop. That's the textbook RAG setup, and it worked locally. It fell apart the moment I tried to deploy the whole stack as a single container on Hugging Face Spaces. The deployment used supervisord to run Redis, the Go API, two Go worker replicas, and the FastAPI AI service all inside one container, and originally a separate ChromaDB process alongside them. That's five long-running processes competing for a small amount of memory and CPU in a free-tier container, with supervisord responsible for starting them in the right order and keeping them alive. ChromaDB was the one that kept causing startup races and silent failures. After enough commits with messages like "fix: remove ChromaDB from supervisord" and "fix: replace ChromaDB with in-memory BM25 search," I made the call to rip it out entirely. The replacement is about 50 lines of pure Python, with no embedding model, no external process, and no network call: python def bm25 score query tokens, doc tokens, k1=1.5, b=0.75 : avg dl = sum len d for d in CORPUS / len CORPUS tf = Counter doc tokens score = 0.0 for term in query tokens: if term not in tf: continue idf = idf term, CORPUS dl = len doc tokens score += idf tf term k1 + 1 / tf term + k1 1 - b + b dl / avg dl return score This is the standard Okapi BM25 formula, computed fresh against the in-memory runbook corpus on every query. No index to build, no daemon to keep alive, no embedding latency on cold start. The trade-off is real: BM25 only matches on term overlap, so a query phrased very differently from the runbook's wording synonyms, paraphrasing won't score well. But for a fixed set of 10 short, keyword-dense IT runbooks where users are typically searching with the same vocabulary the runbooks use "VPN," "password reset," "outage" , that weakness barely shows up in practice. The thing that mattered more than retrieval quality at this scale was that the service now starts reliably every single time. A few of the constants in this pipeline were deliberate tuning decisions rather than defaults I left untouched. None of this is a RAGAS-style evaluation with precision/recall/faithfulness scores. There's no eval harness here, just systems-level tuning based on the constraints I was working under a free-tier LLM provider, a single demo container, and a knowledge base that doesn't change . | Constant | Value | Why | |---|---|---| Retrieved docs k | 4 | Enough runbook context to usually cover the right answer without bloating the prompt against the 800-token response budget | BM25 k1 / b | 1.5 / 0.75 | Standard Robertson defaults, since with only 10 documents there isn't enough signal to meaningfully tune these per-corpus | | Bot temperature | 0.2 | Troubleshooting answers should be literal and repeatable, not creative | | Classifier temperature | 0.1 | Output is parsed as JSON; near-deterministic reduces malformed responses | | Recommender temperature | 0.3 | Slightly more room since it's reasoning over queue state, not just extracting fields | Bot max tokens | 800 | Long enough for multi-step troubleshooting guidance, short enough to keep streaming snappy | Classifier max tokens | 250 | The schema is small, just eight short fields and no prose | | Session history window | last 10 turns, capped at 20 stored, 1-hour TTL in Redis | Enough continuity for a real troubleshooting conversation without memory growing unbounded | | Rate limit | 30 requests/minute per session | Protects the free Groq quota from being burned by a single runaway client | | LLM client retries | 0, with a 10s timeout | Every caller already has its own fallback keyword classifier, rule-based recommender, canned bot response , so retrying into the same failure just adds latency before falling back anyway | That last one is worth dwelling on. Every AI-backed endpoint in this system has a non-LLM fallback path. If Groq is rate-limited or down, the classifier falls back to keyword matching, the recommender falls back to threshold-based rules on queue depth, and the bot falls back to a templated response built from the same retrieved runbook excerpts. The system was designed to degrade, not fail, which matters a lot more when you're running on a free API tier than it would on a paid, SLA-backed one. Strictly, yes: it retrieves before it generates, and the generation is conditioned on what's retrieved. But it's a narrow slice of what RAG can mean. There's no chunking each runbook is embedded as one flat document , no re-ranking step, no hybrid retrieval, and no evaluation loop telling me whether the right runbook actually got surfaced for a given question. It's RAG sized correctly for the problem: a small, static, keyword-friendly knowledge base where the cost of building anything more elaborate would have outweighed the benefit. Whether BM25-over-ChromaDB was "better" depends on what you're optimizing for. For retrieval quality on a larger, more varied corpus, an embedding-based approach would win, since BM25 degrades once questions stop reusing the document's own vocabulary. But for this deployment, with this knowledge base size and this hosting constraint, dropping the vector store was unambiguously the right call: it eliminated an entire class of deployment failures and removed a dependency for a problem that ten short documents don't actually need solved with embeddings. If I were extending this rather than rebuilding it, the next real upgrades would be a basic retrieval eval even just "did the correct runbook end up in the top 4 for a labeled set of test questions" , splitting the longer runbooks into smaller chunks so the model gets more relevant text per retrieved slot, and a hybrid approach once the knowledge base grows past roughly fifty documents. Somewhere around that scale, pure keyword overlap stops being enough to catch the paraphrased queries vector search handles for free. That's also roughly the direction I went on a separate project, AskMyDoc, where I paired BM25 with ChromaDB in a hybrid retriever, added HyDE-style query rewriting to bridge the vocabulary gap, and built a RAGAS-based evaluation harness to actually measure retrieval quality instead of eyeballing it. SmartQueue's BM25-only pipeline was the right tool for a ten-document, single-container helpdesk demo. It's not the pipeline I'd reach for if the knowledge base were a thousand documents instead of ten, but knowing the difference, and being able to justify it with a real deployment failure rather than a hunch, is the actual lesson this project taught me.