cd /news/large-language-models/i-built-a-production-oriented-multi-… · home topics large-language-models article
[ARTICLE · art-18989] src=dev.to pub= topic=large-language-models verified=true sentiment=↑ positive

I Built a Production-Oriented Multi-Provider AI Chatbot in Rust — Here's How

A developer built a production-oriented multi-provider AI chatbot backend in Rust, unifying Claude, OpenAI, and Ollama behind a single interface. The project features a clean five-module architecture with a `ChatClient` that dispatches provider-specific API calls, using `Arc>` for safe shared conversation state across async handlers. The backend includes a Web UI, CLI mode, and Docker support, with the key insight being separate code paths for Anthropic's native API format versus the OpenAI-compatible schema used by OpenAI and Ollama.

read5 min publishedMay 31, 2026

Most AI chatbot tutorials reach for Python. FastAPI, LangChain, a quick requests.post

— done in 20 minutes. And that's fine for prototyping. But when I wanted to build something I'd actually put behind a real API — something with proper async concurrency, typed errors, and zero GC s — I reached for Rust instead.

This is a writeup of ** chatbot**, a production-oriented Rust backend that unifies Claude, OpenAI, and Ollama behind a single interface — with a Web UI, CLI mode, and Docker support baked in.

It's a fair question. LLM API calls are network-bound, so why does the backend language even matter?

A few reasons:

For LLM apps specifically: yes, 95% of your wall-clock time is waiting for the model to respond. But the other 5% — routing, state management, provider selection, connection handling — is all yours to control. Rust makes that part bulletproof.

http://localhost:8080

by default, or runs in

User (Browser or CLI)
        │
        ▼
  Axum HTTP Server (web.rs)
        │
        ├──▶ Conversation State (Arc<Mutex<Vec<Message>>>)
        │
        └──▶ Runtime Config (config.rs)
                    │
                    ▼
            ChatClient (client.rs)
                    │
           ┌────────┼────────┐
           ▼        ▼        ▼
        Claude   OpenAI   Ollama
         API      API      API

The project has a clean five-module layout in src/

:

src/
├── main.rs          # Startup routing + CLI loop
├── config.rs        # Provider enum + env/runtime config
├── client.rs        # Provider-specific HTTP clients
├── conversation.rs  # In-memory chat state model
└── web.rs           # Axum routes, connect flow, chat API

Each module has exactly one responsibility. No god objects, no tangled imports.

The heart of the project is client.rs

. Instead of sprinkling provider-specific logic everywhere, all outbound AI calls go through a single ChatClient

that dispatches based on the active provider.

The key insight: Claude uses the Anthropic native API format, while OpenAI and Ollama both speak the OpenAI-compatible schema. Separating these two code paths keeps the provider logic honest — you're not faking compatibility where there isn't any.

// Simplified concept from client.rs
pub enum Provider {
    Claude,
    OpenAI,
    Ollama,
}

pub struct ChatClient {
    pub provider: Provider,
    pub model: String,
    pub base_url: String,
    pub api_key: Option<String>,
    pub max_tokens: u32,
    pub system_prompt: String,
    pub http: reqwest::Client,
}

When you send a message, the client picks the right HTTP contract:

pub async fn send(&self, messages: &[Message]) -> Result<String> {
    match self.provider {
        Provider::Claude => self.send_claude(messages).await,
        Provider::OpenAI | Provider::Ollama => self.send_openai_compat(messages).await,
    }
}

This means adding a new provider (Gemini, Cohere, etc.) in the future is a matter of adding one arm and one method — the rest of the application stays untouched.

Multi-turn chat requires persistent message history. In Rust's async model, sharing state across request handlers requires explicit synchronization. The project does this with Arc<Mutex<...>>

, Rust's standard pattern for shared mutable state:

// conversation.rs - shared across all handlers
pub type SharedConversation = Arc<Mutex<Vec<Message>>>;

#[derive(Clone, Debug, Serialize, Deserialize)]
pub struct Message {
    pub role: String,    // "user" or "assistant"
    pub content: String,
}

The Arc

makes the conversation cloneable across Axum handlers (each handler runs in its own async task), and the Mutex

ensures only one handler touches the history at a time. No race conditions, guaranteed by the type system.

The app launches in Web UI mode by default, but also supports a terminal workflow:

cargo run

cargo run -- cli

PORT=3000 cargo run -- web

This is useful in different contexts — the Web UI for demos and sharing, the CLI for scripting and piping into other tools.

Download the prebuilt Windows executable (v1.0.1) and run it:

.\chatbot.exe
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

git clone https://github.com/MihirMohapatra/chatbot.git
cd chatbot
cp .env.example .env

cargo run

Set your provider in .env

:

PROVIDER=claude
ANTHROPIC_API_KEY=sk-ant-...
PROVIDER=openai
OPENAI_API_KEY=sk-...
ollama pull llama3.2
PROVIDER=ollama
MODEL=llama3.2
PROVIDER=openai
OPENAI_API_KEY=sk-or-...
BASE_URL=https://openrouter.ai/api
MODEL=anthropic/claude-sonnet-4

All environment variables:

Variable Default Description
PROVIDER
claude
claude , openai , ollama
ANTHROPIC_API_KEY
Required for Claude
OPENAI_API_KEY
Required for OpenAI
MODEL
provider default Override the model name
BASE_URL
provider default Override the API endpoint
MAX_TOKENS
1024
Response token cap
SYSTEM_PROMPT
built-in Custom assistant behavior
docker build -t chatbot .
docker run -it --rm -v .env:/data/.env chatbot

For Ollama with host networking:

docker run -it --rm --network host -v .env:/data/.env chatbot

For backend workloads like this (concurrent HTTP + JSON + state management), Rust consistently outperforms Python across the metrics that matter in production:

Metric Rust Python Why It Matters
Throughput (req/sec) Higher Lower More concurrent users per instance
P95/P99 latency Lower under load Higher under load More stable response times
Memory per worker Lower Higher Better infra cost and density
CPU efficiency Higher Lower More headroom before scaling out

Note: For LLM apps, model/API network time dominates total latency. But Rust still wins on concurrency behavior, memory footprint, and server efficiency — which directly impacts cost and reliability at scale.

tracing

crate + OpenTelemetry integrationThe project is open source and MIT licensed:

👉 github.com/MihirMohapatra/chatbot

If you're exploring Rust for backend systems, or building something that needs to talk to multiple AI providers without writing boilerplate for each one, this is a good starting point. Issues and PRs welcome.

Built with Rust 1.80+, Tokio, Axum, reqwest, serde, and anyhow.

── more in #large-language-models 4 stories · sorted by recency
sponsored brought to you by zahid.host 4,200+ EU-deployed projects
reading about agents? ship yours in a single git push.

Run your AI side-project on zahid.host

EU-based hosting, git-push deploys, automatic HTTPS, no cold starts. Free tier with a custom domain — perfect for shipping the agent you just read about.

$git push zahid main
Live at https://your-agent.zahid.host
Get free account → Pricing
from €0/mo · no card required
LIVE [news/i-built-a-production…] indexed:0 read:5min 2026-05-31 ·