How I built a 3-provider LLM fallback system in production (and what actually broke)

A pre-final year student built Socra, a multi-agent LLM SaaS that interrogates startup ideas using five specialist AI personas, and deployed it on Railway with a three-provider fallback chain (Anthropic → Google → Groq). The initial single-provider setup broke under real traffic due to Groq's free tier rate limits, causing 429 errors on three out of five parallel agent calls. The fix was to implement a priority-based routing system that checks for API keys in order, defaulting to Google Gemini 2.0 Flash for its 150× higher free-tier headroom, which resolved the production failures.

I'm a pre-final year student. I built Socra https://socra-production.up.railway.app/ https://socra-production.up.railway.app/ — a multi-agent LLM SaaS that interrogates your startup idea using 5 specialist AI personas before generating an architecture masterplan. It has paying users. It runs on Railway. And for the first two weeks of production, it was quietly broken in a way I didn't notice until real users hit it. This is the story of how I built the 3-provider fallback chain Anthropic → Google → Groq , what broke along the way, and the actual code that runs in production today. When I first deployed Socra, the LLM routing was simple: one provider, one model, one API key. It worked fine in development. Then real users started using it. Groq's free tier is 6,000 tokens per minute . A single Socra masterplan pipeline — 5 specialist agents running in parallel, each with ~1,500 input tokens — consumes roughly 9,500 tokens in one burst. The math: 3 out of 5 agents were returning Error code: 429 on every session with any real traffic. The app was showing agent cards to users. Some said "Error" in amber text. I thought it was a race condition. It wasn't. It was me naively assuming one free-tier API could handle a multi-agent pipeline. The fix wasn't to optimize — it was to add redundancy. The final production routing order: 1. Anthropic Claude Haiku — if ANTHROPIC API KEY is set 2. Google Gemini 2.0 Flash — if GOOGLE API KEY is set ← production default 3. Groq LLaMA 3.1 8B — if GROQ API KEY is set ← fallback 4. Stub mode — demo scenarios, no API key needed Why this order? Cost and rate limits, not model quality: | Provider | Model | Input $/MTok | Output $/MTok | Free tier TPM | |---|---|---|---|---| | Anthropic | claude-haiku-4-5 | $0.80 | $4.00 | None | | gemini-2.0-flash | $0.075 | $0.30 | 1,000,000 | | | Groq | llama-3.1-8b-instant | $0.06 | $0.06 | 6,000 | Google's free tier is 150× more headroom than Groq for a pipeline that fires 5 LLM calls simultaneously. For a student-built SaaS where LLM cost needs to be near zero while testing, that's not a small difference — it's the difference between the app working and not working. Every LLM call in the system goes through one of two entrypoints: call llm non-streaming, for structured JSON and stream llm tokens streaming, for conversation text . Both use the same routing logic: python backend/llm client.py async def call llm system: str, messages: list dict , max tokens: int, json mode: bool = False - str: if settings.anthropic api key: return await call anthropic system, messages, max tokens, json mode elif settings.google api key: return await call google system, messages, max tokens, json mode elif settings.groq api key: return await call groq system, messages, max tokens, json mode else: return stub response messages Dead simple. The routing is just: which key is set? The first match wins. Google AI Studio exposes an OpenAI-compatible endpoint. This means you don't need the Google SDK — just point the OpenAI SDK at a different base URL: python async def call google system: str, messages: list dict , max tokens: int, json mode: bool = False - str: from openai import AsyncOpenAI client = AsyncOpenAI api key=settings.google api key, base url="https://generativelanguage.googleapis.com/v1beta/openai/", kwargs = { "model": "gemini-2.0-flash", "max tokens": max tokens, "messages": {"role": "system", "content": system}, messages , } if json mode: kwargs "response format" = {"type": "json object"} response = await client.chat.completions.create kwargs return response.choices 0 .message.content or "" Same pattern works for streaming — just use stream=True and iterate async for chunk in stream . This is a pattern worth knowing: Groq, Azure OpenAI, and Google AI Studio all support the OpenAI-compatible endpoint format. If you write against the OpenAI SDK with configurable base url and api key , you get multi-provider support with almost no extra code. Here's where it got messy. After the multi-agent pipeline runs and generates a masterplan, Socra needs structured JSON back from the LLM — eval scores, assumption tracking, quick reply choices. The original approach was a separator in the stream: Stream: "Here are my questions... JSON {"eval delta": {...}, "choices": ... }" This worked fine with Anthropic Claude follows formatting instructions reliably . It broke completely with smaller models. The 8B Groq model would occasionally include the separator, occasionally not, occasionally put it in the middle of a sentence. Parsing failed silently and choices came back empty — users saw no quick reply options after the first message. The fix: two separate calls. Call 1: Stream plain text, no format requirements async for token in stream llm tokens system, messages : yield token full message += token Call 2: After streaming ends, get structured data separately eval data = await call llm system=eval system prompt, messages=messages + {"role": "assistant", "content": full message} , json mode=True The Anthropic path still uses the separator it's reliable there and saves one API call . The Groq and Google paths use two calls. A bit more latency, zero parsing failures. This one cost me 45 minutes. After deploying to Railway, every LLM call was failing with Illegal header value . The API key was correct — I'd copied it straight from the Groq console. Except I hadn't. I'd pasted it into Railway's Variables tab and there was an invisible \n at the end. The fix was two things: .strip defensively in config.py : class Settings BaseSettings : groq api key: str = "" anthropic api key: str = "" google api key: str = "" @validator 'groq api key', 'anthropic api key', 'google api key', pre=True def strip keys cls, v : return v.strip if v else v Now the app is defensive against copy-paste mistakes. The .strip costs nothing and prevents a class of errors that are genuinely hard to debug. After adding Google as the second provider, I pushed to Railway and checked the logs. They said: Using Groq LLaMA for LLM calls But I'd set GOOGLE API KEY . For two days I thought Google wasn't working. It was. The startup log was wrong. The main.py lifespan check had a bug: Before — skipped Google entirely if settings.anthropic api key: logger.info "Using Anthropic Claude" elif settings.groq api key: ← checked Groq before Google logger.info "Using Groq LLaMA" The actual routing in call llm was correct Google checked second, before Groq . But the log check had a different order — so if Groq was also set it was , it logged "Using Groq" even though every actual call was going to Google. Fix: mirror the routing logic exactly in the startup log. Running 5 parallel specialist agents against Groq's 6k TPM free tier: the math never worked and I was pretending it did. Each agent gets ~1,500 input tokens + generates ~400 output tokens = ~1,900 tokens per call. 5 parallel calls = 9,500 tokens launched simultaneously. Groq's rate limiter sees all 9,500 in the same minute window and rejects the overflow. Three approaches I tried, in order: Approach 1: Retry with backoff. Added 3-attempt retry with 4s/8s exponential backoff on 429 errors. Helped slightly. Didn't fix the underlying math. Approach 2: Sequential execution with delays. Switched from asyncio.gather to sequential calls with 1.5s gaps between agents. This spread the token burst across multiple rate-limit windows. Worked on Groq, but added ~7.5s to the masterplan pipeline — noticeable. Approach 3: Switch to Google. Google's free tier is 1,000,000 TPM. Problem disappeared entirely. Now Groq is the fallback, not the primary. The real lesson: design for the rate limits of your fallback providers, not just your primary. Groq is fast and cheap but not meant for parallel multi-agent workloads on the free tier. After switching to Google as the production default, I did a full token and cost breakdown per session: | Stage | Input tokens | Output tokens | |---|---|---| | Conversation 7 turns avg | ~16,700 | ~3,500 | | 5 specialist agents | ~24,000 | ~3,500 | | Synthesis | ~12,700 | ~2,500 | | Devil's advocate | ~2,800 | ~600 | Total per session | ~56,200 | ~10,100 | At Google Gemini Flash pricing $0.075 input / $0.30 output per million tokens : Input cost: 56,200 / 1,000,000 × $0.075 = $0.0042 Output cost: 10,100 / 1,000,000 × $0.30 = $0.0030 Total: ~$0.007 per session Socra charges ₹499 ~$6 for a full masterplan session. LLM cost per session: $0.007 . That's 99.8% gross margin on the LLM cost alone . Railway hosting is ~$30/month fixed. Break-even is roughly 6 paid sessions per month. This math only works because of the provider choice. The same session on Anthropic Haiku costs ~$0.085 — 12× more expensive, which would put margins at ~98.6%. Still fine, but the point is: provider selection is a product decision, not just a technical one. 1. Design for multi-provider from day one. I added the fallback chain in Phase 3 after production broke. It should have been in the architecture from the start. The routing abstraction call llm with provider detection is simple enough to add in 30 minutes — there's no reason to start with a single provider. 2. Test the rate limit math before deploying parallel calls. 5 parallel agents × 1,900 tokens = 9,500 tokens in one burst. Groq's free tier is 6,000 TPM. This is elementary arithmetic that I didn't do until users were getting errors. 3. Strip API keys at the config layer. .strip in your settings class is a 5-minute change that eliminates an entire class of deployment bugs. 4. Make your startup log mirror your routing logic exactly. A log that says "Using Groq" when you're actually using Google is worse than no log — it actively misleads debugging. Socra is built on: FastAPI + React + PostgreSQL + Railway + LangGraph for the multi-agent pipeline + Langfuse v4 for per-call LLM observability + Clerk auth + Razorpay payments . The LLM fallback chain described here handles all LLM calls across the entire system — conversation, agents, synthesis, pitch deck generation, and the tribunal verdict scoring. The live app is at socra-production.up.railway.app https://socra-production.up.railway.app . The approach described here — OpenAI-compatible endpoints, two-call structured output, provider detection at the config layer — is all running in production today. I'm a pre-final year student at HBTU Kanpur building production ML systems. If you're working on something similar or have questions about the multi-agent architecture, I'm on LinkedIn and GitHub.