Building a Self-Hosted RAG Chatbot with a Dual-Agent LLM Pipeline (and Automatic LLM Failover)

wpnews.pro

Over the past few weeks I built a Retrieval-Augmented Generation (RAG) chatbot from the ground up — one that answers strictly from a knowledge base I control, supports role-based access for regular users vs. admins, and never makes things up. In this post I want to walk through the architecture, the design decisions, and one piece that took real engineering effort: automatically switching to a backup LLM when the primary one is busy or rate-limited, so a query never just fails.

This isn’t a toy demo — it’s a full pipeline with authentication, a vector database, a two-stage reasoning process, and an admin console for managing the knowledge base. Here’s how it all fits together.

Most “just call an LLM API” chatbots have two issues:

I wanted a system where:

The system has three user-facing surfaces and one processing core:

Here’s the high-level flow:

flowchart TD
    A[Signup Page - Enter Email] --> B{Role Check}
    B -->|Regular Email| C[User View - Query Only]
    B -->|Admin Email| D[Admin View - Manage + Query]
    C --> E[Query Pipeline]
    D --> E
    E --> F[Agent 1: Tool & Argument Selection]
    F --> G[Execute Retrieval Tool]
    G --> H[Raw Retrieved Chunks]
    H --> I[Agent 2: Refinement]
    I --> J[Final Answer Returned]

If your blogging platform doesn’t render Mermaid diagrams (Medium, for instance, won’t render this natively), here’s the same flow as plain text:

Signup Page
   |
   v
Role Check (User or Admin?)
   |---------------------|
   v                      v
User View             Admin View
(Query Only)        (Manage + Query)
   |---------------------|
              v
        Query Pipeline
              v
   Agent 1: Tool & Argument Selection
              v
      Execute Retrieval Tool
              v
       Raw Retrieved Chunks
              v
       Agent 2: Refinement
              v
        Final Answer Returned

A single LLM call trying to both decide what to retrieve and how to phrase the final answer tends to produce messier output. So I split the reasoning into two sequential agents, orchestrated as a small graph of steps rather than one big prompt:

Agent 1 — Tool Decision Takes the raw user query and decides which retrieval tool to call and with what arguments. It returns a small, strict structure like:

[{"name": "retrieval_tool", "arguments": {"query": "your parsed query here"}}]

A lightweight Python handler parses this output, extracts the tool name and arguments, and manually executes the retrieval step — no need to trust the LLM to “just run the function,” which keeps things deterministic and debuggable.

Agent 2 — Refinement Takes whatever raw chunks came back from retrieval and turns them into a clear, well-formatted, professional answer. This is also the layer that keeps responses grounded — it’s instructed to work only with what was retrieved, not to add outside knowledge.

sequenceDiagram
    participant U as User/Admin
    participant A1 as Agent 1 (Tool Decision)
    participant T as Retrieval Tool
    participant V as Vector Store
    participant A2 as Agent 2 (Refinement)
php
    U->>A1: Submits query
    A1->>A1: Decide tool + arguments
    A1->>T: Call retrieval tool
    T->>V: Similarity search
    V-->>T: Top-K relevant chunks
    T-->>A2: Raw chunks + sources
    A2->>A2: Refine into polished answer
    A2-->>U: Final answer

Documents and URLs go through a standard but carefully tuned pipeline:

Two tunable parameters matter a lot here:

score_threshold

k

Admins get a dedicated panel to manage the knowledge base directly:

Regular users never see any of this — they only get a query box. Role detection is based on a simple, server-side check against a list of approved admin emails at signup time, and every admin-only route re-checks that role before doing anything destructive or data-modifying.

Here’s the piece I want to focus on, since it’s the part most tutorials skip. If you’ve ever called a single LLM provider in production, you’ve hit this: rate limits, timeouts, momentary outages, or a model that’s just slow to respond under load. If your whole app depends on one model, one bad moment takes everything down with it.

The fix is a failover-aware LLM router sitting in front of both agents. Instead of hardcoding “always call Model A,” every LLM call goes through a small dispatcher that:

Practically, here’s the setup that works well:

1. Define a prioritized list of models, not a single model.

LLM_PRIORITY = [
    {"name": "primary_llm", "provider": "provider_a", "model": "model-a-large"},
    {"name": "secondary_llm", "provider": "provider_b", "model": "model-b-large"},
    {"name": "fallback_llm", "provider": "provider_c", "model": "model-c-small"},
]

Order them by quality/cost first, reliability second. The first entry is your “ideal” answer quality; the rest exist purely so a request never just dies.

2. Wrap every call in a router function that catches specific failure types.

import time
python
def call_llm_with_failover(prompt, models=LLM_PRIORITY, max_retries_per_model=1):
    last_error = None
for model_config in models:
        for attempt in range(max_retries_per_model):
            try:
                response = call_provider(
                    provider=model_config["provider"],
                    model=model_config["model"],
                    prompt=prompt,
                    timeout=15  # seconds — don't let one model hang the whole pipeline
                )
                log_model_usage(model_config["name"])
                return response
except RateLimitError:
                last_error = "rate_limited"
                break  # don't retry the same rate-limited model, move to next one
except TimeoutError:
                last_error = "timeout"
                continue  # maybe worth one retry on the same model
except ServerBusyError:
                last_error = "busy"
                break
raise AllModelsUnavailableError(f"All models exhausted. Last error: {last_error}")

3. Use this wrapper for both agents, not just one. Agent 1 (tool decision) and Agent 2 (refinement) should each go through the same failover router independently — it’s entirely possible for the model serving Agent 1 to be busy while the model serving Agent 2 is fine, or vice versa.

4. Add a circuit breaker so you don’t keep hammering a model that’s clearly down. A simple in-memory counter works for small-to-medium traffic:

from collections import defaultdict
from time import time
failure_counts = defaultdict(list)
COOLDOWN_SECONDS = 60
FAILURE_THRESHOLD = 3
python
def is_model_in_cooldown(model_name):
    now = time()
    recent_failures = [t for t in failure_counts[model_name] if now - t < COOLDOWN_SECONDS]
    failure_counts[model_name] = recent_failures
    return len(recent_failures) >= FAILURE_THRESHOLD
python
def record_failure(model_name):
    failure_counts[model_name].append(time())

Check is_model_in_cooldown()

before attempting a model in the priority list — if it’s in cooldown, skip straight to the next one instead of wasting a request and a timeout window on a model you already know is struggling.

5. Make the model list configurable, not hardcoded. Keep it in an environment variable or a small config file so you can reorder priority, swap providers, or add a new model without touching code:

LLM_PRIORITY_ORDER=primary_llm,secondary_llm,fallback_llm
PRIMARY_LLM_PROVIDER=provider_a
SECONDARY_LLM_PROVIDER=provider_b
FALLBACK_LLM_PROVIDER=provider_c
REQUEST_TIMEOUT_SECONDS=15
COOLDOWN_SECONDS=60

This is the difference between “I have to redeploy to change providers” and “I edit one line in .env

and restart the service."

With this in place, a query never just hangs or errors out because one provider happened to be under load at that exact second. The user experience stays consistent — they get an answer, possibly from a slightly different underlying model, but they’re never staring at a spinner that times out. And because every fallback event is logged, you get visibility into how often your primary model is actually struggling, which is useful data for deciding whether to renegotiate rate limits, add a fourth fallback, or just accept the current setup.

flowchart TD
    Q[Incoming Query] --> P[Try Primary LLM]
    P -->|Success| R[Return Answer]
    P -->|Rate Limited / Busy / Timeout| S[Try Secondary LLM]
    S -->|Success| R
    S -->|Failure| T[Try Fallback LLM]
    T -->|Success| R
    T -->|Failure| U[Raise Controlled Error]

For anyone curious about the stack, at a glance:

A few things on the roadmap:

The biggest lesson from this project: a RAG chatbot is “easy” to build to a demo-quality bar, and genuinely hard to build to a doesn’t-fall-over-in-production bar. The retrieval and refinement pipeline gets you accurate, grounded answers. The failover router is what keeps the lights on when your LLM provider has a bad five minutes. Both matter — and most tutorials only show you the first one.

If you’re building something similar, start with the dual-agent retrieval pipeline to get accuracy right, then treat your LLM calls as a resource pool rather than a single dependency from day one. Retrofitting failover later is a lot more painful than designing for it up front.

source & further reading

discuss.huggingface.co — original article Rakarrack-0.6.1 port making progress! ( AI assisted ) Cloud Storage Poll Welcome to Haiku basic(Haiku Docs, Haiku slide and Haiku sheets)

Building a Self-Hosted RAG Chatbot with a Dual-Agent LLM Pipeline (and Automatic LLM Failover)

Run your AI side-project on zahid.host