{"slug": "building-a-self-hosted-rag-chatbot-with-a-dual-agent-llm-pipeline-and-automatic", "title": "Building a Self-Hosted RAG Chatbot with a Dual-Agent LLM Pipeline (and Automatic LLM Failover)", "summary": "A developer built a self-hosted Retrieval-Augmented Generation (RAG) chatbot with a dual-agent LLM pipeline, role-based access for users and admins, and automatic failover to a backup LLM when the primary is rate-limited. The system uses a vector database, two-stage reasoning (tool selection then answer refinement), and an admin console for knowledge base management to ensure grounded, hallucination-free responses.", "body_md": "Over the past few weeks I built a Retrieval-Augmented Generation (RAG) chatbot from the ground up — one that answers strictly from a knowledge base I control, supports role-based access for regular users vs. admins, and never makes things up. In this post I want to walk through the architecture, the design decisions, and one piece that took real engineering effort: **automatically switching to a backup LLM when the primary one is busy or rate-limited**, so a query never just fails.\n\nThis isn’t a toy demo — it’s a full pipeline with authentication, a vector database, a two-stage reasoning process, and an admin console for managing the knowledge base. Here’s how it all fits together.\n\nMost “just call an LLM API” chatbots have two issues:\n\nI wanted a system where:\n\nThe system has three user-facing surfaces and one processing core:\n\nHere’s the high-level flow:\n\n``` php\nflowchart TD\n    A[Signup Page - Enter Email] --> B{Role Check}\n    B -->|Regular Email| C[User View - Query Only]\n    B -->|Admin Email| D[Admin View - Manage + Query]\n    C --> E[Query Pipeline]\n    D --> E\n    E --> F[Agent 1: Tool & Argument Selection]\n    F --> G[Execute Retrieval Tool]\n    G --> H[Raw Retrieved Chunks]\n    H --> I[Agent 2: Refinement]\n    I --> J[Final Answer Returned]\n```\n\nIf your blogging platform doesn’t render Mermaid diagrams (Medium, for instance, won’t render this natively), here’s the same flow as plain text:\n\n```\nSignup Page\n   |\n   v\nRole Check (User or Admin?)\n   |---------------------|\n   v                      v\nUser View             Admin View\n(Query Only)        (Manage + Query)\n   |---------------------|\n              v\n        Query Pipeline\n              v\n   Agent 1: Tool & Argument Selection\n              v\n      Execute Retrieval Tool\n              v\n       Raw Retrieved Chunks\n              v\n       Agent 2: Refinement\n              v\n        Final Answer Returned\n```\n\nA single LLM call trying to both decide *what to retrieve* and *how to phrase the final answer* tends to produce messier output. So I split the reasoning into two sequential agents, orchestrated as a small graph of steps rather than one big prompt:\n\n**Agent 1 — Tool Decision** Takes the raw user query and decides which retrieval tool to call and with what arguments. It returns a small, strict structure like:\n\n```\n[{\"name\": \"retrieval_tool\", \"arguments\": {\"query\": \"your parsed query here\"}}]\n```\n\nA lightweight Python handler parses this output, extracts the tool name and arguments, and manually executes the retrieval step — no need to trust the LLM to “just run the function,” which keeps things deterministic and debuggable.\n\n**Agent 2 — Refinement** Takes whatever raw chunks came back from retrieval and turns them into a clear, well-formatted, professional answer. This is also the layer that keeps responses grounded — it’s instructed to work only with what was retrieved, not to add outside knowledge.\n\n```\nsequenceDiagram\n    participant U as User/Admin\n    participant A1 as Agent 1 (Tool Decision)\n    participant T as Retrieval Tool\n    participant V as Vector Store\n    participant A2 as Agent 2 (Refinement)\nphp\n    U->>A1: Submits query\n    A1->>A1: Decide tool + arguments\n    A1->>T: Call retrieval tool\n    T->>V: Similarity search\n    V-->>T: Top-K relevant chunks\n    T-->>A2: Raw chunks + sources\n    A2->>A2: Refine into polished answer\n    A2-->>U: Final answer\n```\n\nDocuments and URLs go through a standard but carefully tuned pipeline:\n\nTwo tunable parameters matter a lot here:\n\n`score_threshold`\n\n`k`\n\nAdmins get a dedicated panel to manage the knowledge base directly:\n\nRegular users never see any of this — they only get a query box. Role detection is based on a simple, server-side check against a list of approved admin emails at signup time, and every admin-only route re-checks that role before doing anything destructive or data-modifying.\n\nHere’s the piece I want to focus on, since it’s the part most tutorials skip. If you’ve ever called a single LLM provider in production, you’ve hit this: rate limits, timeouts, momentary outages, or a model that’s just slow to respond under load. If your whole app depends on one model, one bad moment takes everything down with it.\n\nThe fix is a **failover-aware LLM router** sitting in front of both agents. Instead of hardcoding “always call Model A,” every LLM call goes through a small dispatcher that:\n\nPractically, here’s the setup that works well:\n\n**1. Define a prioritized list of models, not a single model.**\n\n```\nLLM_PRIORITY = [\n    {\"name\": \"primary_llm\", \"provider\": \"provider_a\", \"model\": \"model-a-large\"},\n    {\"name\": \"secondary_llm\", \"provider\": \"provider_b\", \"model\": \"model-b-large\"},\n    {\"name\": \"fallback_llm\", \"provider\": \"provider_c\", \"model\": \"model-c-small\"},\n]\n```\n\nOrder them by quality/cost first, reliability second. The first entry is your “ideal” answer quality; the rest exist purely so a request never just dies.\n\n**2. Wrap every call in a router function that catches specific failure types.**\n\n``` python\nimport time\npython\ndef call_llm_with_failover(prompt, models=LLM_PRIORITY, max_retries_per_model=1):\n    last_error = None\nfor model_config in models:\n        for attempt in range(max_retries_per_model):\n            try:\n                response = call_provider(\n                    provider=model_config[\"provider\"],\n                    model=model_config[\"model\"],\n                    prompt=prompt,\n                    timeout=15  # seconds — don't let one model hang the whole pipeline\n                )\n                # Success — record which model actually answered\n                log_model_usage(model_config[\"name\"])\n                return response\nexcept RateLimitError:\n                last_error = \"rate_limited\"\n                break  # don't retry the same rate-limited model, move to next one\nexcept TimeoutError:\n                last_error = \"timeout\"\n                continue  # maybe worth one retry on the same model\nexcept ServerBusyError:\n                last_error = \"busy\"\n                break\nraise AllModelsUnavailableError(f\"All models exhausted. Last error: {last_error}\")\n```\n\n**3. Use this wrapper for both agents**, not just one. Agent 1 (tool decision) and Agent 2 (refinement) should each go through the same failover router independently — it’s entirely possible for the model serving Agent 1 to be busy while the model serving Agent 2 is fine, or vice versa.\n\n**4. Add a circuit breaker so you don’t keep hammering a model that’s clearly down.** A simple in-memory counter works for small-to-medium traffic:\n\n``` python\nfrom collections import defaultdict\nfrom time import time\nfailure_counts = defaultdict(list)\nCOOLDOWN_SECONDS = 60\nFAILURE_THRESHOLD = 3\npython\ndef is_model_in_cooldown(model_name):\n    now = time()\n    recent_failures = [t for t in failure_counts[model_name] if now - t < COOLDOWN_SECONDS]\n    failure_counts[model_name] = recent_failures\n    return len(recent_failures) >= FAILURE_THRESHOLD\npython\ndef record_failure(model_name):\n    failure_counts[model_name].append(time())\n```\n\nCheck `is_model_in_cooldown()`\n\nbefore attempting a model in the priority list — if it’s in cooldown, skip straight to the next one instead of wasting a request and a timeout window on a model you already know is struggling.\n\n**5. Make the model list configurable, not hardcoded.** Keep it in an environment variable or a small config file so you can reorder priority, swap providers, or add a new model without touching code:\n\n```\n# .env\nLLM_PRIORITY_ORDER=primary_llm,secondary_llm,fallback_llm\nPRIMARY_LLM_PROVIDER=provider_a\nSECONDARY_LLM_PROVIDER=provider_b\nFALLBACK_LLM_PROVIDER=provider_c\nREQUEST_TIMEOUT_SECONDS=15\nCOOLDOWN_SECONDS=60\n```\n\nThis is the difference between “I have to redeploy to change providers” and “I edit one line in `.env`\n\nand restart the service.\"\n\nWith this in place, a query never just hangs or errors out because one provider happened to be under load at that exact second. The user experience stays consistent — they get an answer, possibly from a slightly different underlying model, but they’re never staring at a spinner that times out. And because every fallback event is logged, you get visibility into how often your primary model is actually struggling, which is useful data for deciding whether to renegotiate rate limits, add a fourth fallback, or just accept the current setup.\n\n``` php\nflowchart TD\n    Q[Incoming Query] --> P[Try Primary LLM]\n    P -->|Success| R[Return Answer]\n    P -->|Rate Limited / Busy / Timeout| S[Try Secondary LLM]\n    S -->|Success| R\n    S -->|Failure| T[Try Fallback LLM]\n    T -->|Success| R\n    T -->|Failure| U[Raise Controlled Error]\n```\n\nFor anyone curious about the stack, at a glance:\n\nA few things on the roadmap:\n\nThe biggest lesson from this project: a RAG chatbot is “easy” to build to a demo-quality bar, and genuinely hard to build to a doesn’t-fall-over-in-production bar. The retrieval and refinement pipeline gets you accurate, grounded answers. The failover router is what keeps the lights on when your LLM provider has a bad five minutes. Both matter — and most tutorials only show you the first one.\n\nIf you’re building something similar, start with the dual-agent retrieval pipeline to get accuracy right, then treat your LLM calls as a *resource pool* rather than a single dependency from day one. Retrofitting failover later is a lot more painful than designing for it up front.", "url": "https://wpnews.pro/news/building-a-self-hosted-rag-chatbot-with-a-dual-agent-llm-pipeline-and-automatic", "canonical_source": "https://discuss.huggingface.co/t/building-a-self-hosted-rag-chatbot-with-a-dual-agent-llm-pipeline-and-automatic-llm-failover/176986#post_1", "published_at": "2026-06-20 00:00:49+00:00", "updated_at": "2026-06-20 00:13:37.721544+00:00", "lang": "en", "topics": ["large-language-models", "ai-agents", "ai-infrastructure"], "entities": [], "alternates": {"html": "https://wpnews.pro/news/building-a-self-hosted-rag-chatbot-with-a-dual-agent-llm-pipeline-and-automatic", "markdown": "https://wpnews.pro/news/building-a-self-hosted-rag-chatbot-with-a-dual-agent-llm-pipeline-and-automatic.md", "text": "https://wpnews.pro/news/building-a-self-hosted-rag-chatbot-with-a-dual-agent-llm-pipeline-and-automatic.txt", "jsonld": "https://wpnews.pro/news/building-a-self-hosted-rag-chatbot-with-a-dual-agent-llm-pipeline-and-automatic.jsonld"}}