Building a Self-Hosted RAG Chatbot with a Dual-Agent LLM Pipeline (and Automatic LLM Failover)

A developer built a self-hosted Retrieval-Augmented Generation (RAG) chatbot with a dual-agent LLM pipeline, role-based access for users and admins, and automatic failover to a backup LLM when the primary is rate-limited. The system uses a vector database, two-stage reasoning (tool selection then answer refinement), and an admin console for knowledge base management to ensure grounded, hallucination-free responses.

Over the past few weeks I built a Retrieval-Augmented Generation RAG chatbot from the ground up — one that answers strictly from a knowledge base I control, supports role-based access for regular users vs. admins, and never makes things up. In this post I want to walk through the architecture, the design decisions, and one piece that took real engineering effort: automatically switching to a backup LLM when the primary one is busy or rate-limited , so a query never just fails. This isn’t a toy demo — it’s a full pipeline with authentication, a vector database, a two-stage reasoning process, and an admin console for managing the knowledge base. Here’s how it all fits together. Most “just call an LLM API” chatbots have two issues: I wanted a system where: The system has three user-facing surfaces and one processing core: Here’s the high-level flow: php flowchart TD A Signup Page - Enter Email -- B{Role Check} B -- |Regular Email| C User View - Query Only B -- |Admin Email| D Admin View - Manage + Query C -- E Query Pipeline D -- E E -- F Agent 1: Tool & Argument Selection F -- G Execute Retrieval Tool G -- H Raw Retrieved Chunks H -- I Agent 2: Refinement I -- J Final Answer Returned If your blogging platform doesn’t render Mermaid diagrams Medium, for instance, won’t render this natively , here’s the same flow as plain text: Signup Page | v Role Check User or Admin? |---------------------| v v User View Admin View Query Only Manage + Query |---------------------| v Query Pipeline v Agent 1: Tool & Argument Selection v Execute Retrieval Tool v Raw Retrieved Chunks v Agent 2: Refinement v Final Answer Returned A single LLM call trying to both decide what to retrieve and how to phrase the final answer tends to produce messier output. So I split the reasoning into two sequential agents, orchestrated as a small graph of steps rather than one big prompt: Agent 1 — Tool Decision Takes the raw user query and decides which retrieval tool to call and with what arguments. It returns a small, strict structure like: {"name": "retrieval tool", "arguments": {"query": "your parsed query here"}} A lightweight Python handler parses this output, extracts the tool name and arguments, and manually executes the retrieval step — no need to trust the LLM to “just run the function,” which keeps things deterministic and debuggable. Agent 2 — Refinement Takes whatever raw chunks came back from retrieval and turns them into a clear, well-formatted, professional answer. This is also the layer that keeps responses grounded — it’s instructed to work only with what was retrieved, not to add outside knowledge. sequenceDiagram participant U as User/Admin participant A1 as Agent 1 Tool Decision participant T as Retrieval Tool participant V as Vector Store participant A2 as Agent 2 Refinement php U- A1: Submits query A1- A1: Decide tool + arguments A1- T: Call retrieval tool T- V: Similarity search V-- T: Top-K relevant chunks T-- A2: Raw chunks + sources A2- A2: Refine into polished answer A2-- U: Final answer Documents and URLs go through a standard but carefully tuned pipeline: Two tunable parameters matter a lot here: score threshold k Admins get a dedicated panel to manage the knowledge base directly: Regular users never see any of this — they only get a query box. Role detection is based on a simple, server-side check against a list of approved admin emails at signup time, and every admin-only route re-checks that role before doing anything destructive or data-modifying. Here’s the piece I want to focus on, since it’s the part most tutorials skip. If you’ve ever called a single LLM provider in production, you’ve hit this: rate limits, timeouts, momentary outages, or a model that’s just slow to respond under load. If your whole app depends on one model, one bad moment takes everything down with it. The fix is a failover-aware LLM router sitting in front of both agents. Instead of hardcoding “always call Model A,” every LLM call goes through a small dispatcher that: Practically, here’s the setup that works well: 1. Define a prioritized list of models, not a single model. LLM PRIORITY = {"name": "primary llm", "provider": "provider a", "model": "model-a-large"}, {"name": "secondary llm", "provider": "provider b", "model": "model-b-large"}, {"name": "fallback llm", "provider": "provider c", "model": "model-c-small"}, Order them by quality/cost first, reliability second. The first entry is your “ideal” answer quality; the rest exist purely so a request never just dies. 2. Wrap every call in a router function that catches specific failure types. python import time python def call llm with failover prompt, models=LLM PRIORITY, max retries per model=1 : last error = None for model config in models: for attempt in range max retries per model : try: response = call provider provider=model config "provider" , model=model config "model" , prompt=prompt, timeout=15 seconds — don't let one model hang the whole pipeline Success — record which model actually answered log model usage model config "name" return response except RateLimitError: last error = "rate limited" break don't retry the same rate-limited model, move to next one except TimeoutError: last error = "timeout" continue maybe worth one retry on the same model except ServerBusyError: last error = "busy" break raise AllModelsUnavailableError f"All models exhausted. Last error: {last error}" 3. Use this wrapper for both agents , not just one. Agent 1 tool decision and Agent 2 refinement should each go through the same failover router independently — it’s entirely possible for the model serving Agent 1 to be busy while the model serving Agent 2 is fine, or vice versa. 4. Add a circuit breaker so you don’t keep hammering a model that’s clearly down. A simple in-memory counter works for small-to-medium traffic: python from collections import defaultdict from time import time failure counts = defaultdict list COOLDOWN SECONDS = 60 FAILURE THRESHOLD = 3 python def is model in cooldown model name : now = time recent failures = t for t in failure counts model name if now - t < COOLDOWN SECONDS failure counts model name = recent failures return len recent failures = FAILURE THRESHOLD python def record failure model name : failure counts model name .append time Check is model in cooldown before attempting a model in the priority list — if it’s in cooldown, skip straight to the next one instead of wasting a request and a timeout window on a model you already know is struggling. 5. Make the model list configurable, not hardcoded. Keep it in an environment variable or a small config file so you can reorder priority, swap providers, or add a new model without touching code: .env LLM PRIORITY ORDER=primary llm,secondary llm,fallback llm PRIMARY LLM PROVIDER=provider a SECONDARY LLM PROVIDER=provider b FALLBACK LLM PROVIDER=provider c REQUEST TIMEOUT SECONDS=15 COOLDOWN SECONDS=60 This is the difference between “I have to redeploy to change providers” and “I edit one line in .env and restart the service." With this in place, a query never just hangs or errors out because one provider happened to be under load at that exact second. The user experience stays consistent — they get an answer, possibly from a slightly different underlying model, but they’re never staring at a spinner that times out. And because every fallback event is logged, you get visibility into how often your primary model is actually struggling, which is useful data for deciding whether to renegotiate rate limits, add a fourth fallback, or just accept the current setup. php flowchart TD Q Incoming Query -- P Try Primary LLM P -- |Success| R Return Answer P -- |Rate Limited / Busy / Timeout| S Try Secondary LLM S -- |Success| R S -- |Failure| T Try Fallback LLM T -- |Success| R T -- |Failure| U Raise Controlled Error For anyone curious about the stack, at a glance: A few things on the roadmap: The biggest lesson from this project: a RAG chatbot is “easy” to build to a demo-quality bar, and genuinely hard to build to a doesn’t-fall-over-in-production bar. The retrieval and refinement pipeline gets you accurate, grounded answers. The failover router is what keeps the lights on when your LLM provider has a bad five minutes. Both matter — and most tutorials only show you the first one. If you’re building something similar, start with the dual-agent retrieval pipeline to get accuracy right, then treat your LLM calls as a resource pool rather than a single dependency from day one. Retrofitting failover later is a lot more painful than designing for it up front.