{"slug": "how-i-built-a-3-provider-llm-fallback-system-in-production-and-what-actually", "title": "How I built a 3-provider LLM fallback system in production (and what actually broke)", "summary": "A pre-final year student built Socra, a multi-agent LLM SaaS that interrogates startup ideas using five specialist AI personas, and deployed it on Railway with a three-provider fallback chain (Anthropic → Google → Groq). The initial single-provider setup broke under real traffic due to Groq's free tier rate limits, causing 429 errors on three out of five parallel agent calls. The fix was to implement a priority-based routing system that checks for API keys in order, defaulting to Google Gemini 2.0 Flash for its 150× higher free-tier headroom, which resolved the production failures.", "body_md": "I'm a pre-final year student. I built Socra([https://socra-production.up.railway.app/](https://socra-production.up.railway.app/)) — a multi-agent LLM SaaS that interrogates your startup idea using 5 specialist AI personas before generating an architecture masterplan. It has paying users. It runs on Railway. And for the first two weeks of production, it was quietly broken in a way I didn't notice until real users hit it.\n\nThis is the story of how I built the 3-provider fallback chain (Anthropic → Google → Groq), what broke along the way, and the actual code that runs in production today.\n\nWhen I first deployed Socra, the LLM routing was simple: one provider, one model, one API key. It worked fine in development.\n\nThen real users started using it.\n\nGroq's free tier is **6,000 tokens per minute**. A single Socra masterplan pipeline — 5 specialist agents running in parallel, each with ~1,500 input tokens — consumes roughly 9,500 tokens in one burst. The math: 3 out of 5 agents were returning `Error code: 429`\n\non every session with any real traffic.\n\nThe app was showing agent cards to users. Some said \"Error\" in amber text. I thought it was a race condition. It wasn't. It was me naively assuming one free-tier API could handle a multi-agent pipeline.\n\nThe fix wasn't to optimize — it was to add redundancy.\n\nThe final production routing order:\n\n```\n1. Anthropic Claude Haiku   — if ANTHROPIC_API_KEY is set\n2. Google Gemini 2.0 Flash  — if GOOGLE_API_KEY is set  ← production default\n3. Groq LLaMA 3.1 8B        — if GROQ_API_KEY is set    ← fallback\n4. Stub mode                — demo scenarios, no API key needed\n```\n\nWhy this order? Cost and rate limits, not model quality:\n\n| Provider | Model | Input $/MTok | Output $/MTok | Free tier TPM |\n|---|---|---|---|---|\n| Anthropic | claude-haiku-4-5 | $0.80 | $4.00 | None |\n| gemini-2.0-flash | $0.075 | $0.30 | 1,000,000 | |\n| Groq | llama-3.1-8b-instant | $0.06 | $0.06 | 6,000 |\n\nGoogle's free tier is **150× more headroom than Groq** for a pipeline that fires 5 LLM calls simultaneously. For a student-built SaaS where LLM cost needs to be near zero while testing, that's not a small difference — it's the difference between the app working and not working.\n\nEvery LLM call in the system goes through one of two entrypoints: `_call_llm`\n\n(non-streaming, for structured JSON) and `_stream_llm_tokens`\n\n(streaming, for conversation text). Both use the same routing logic:\n\n``` python\n# backend/llm_client.py\n\nasync def _call_llm(system: str, messages: list[dict], max_tokens: int, json_mode: bool = False) -> str:\n    if settings.anthropic_api_key:\n        return await _call_anthropic(system, messages, max_tokens, json_mode)\n    elif settings.google_api_key:\n        return await _call_google(system, messages, max_tokens, json_mode)\n    elif settings.groq_api_key:\n        return await _call_groq(system, messages, max_tokens, json_mode)\n    else:\n        return _stub_response(messages)\n```\n\nDead simple. The routing is just: which key is set? The first match wins.\n\nGoogle AI Studio exposes an OpenAI-compatible endpoint. This means you don't need the Google SDK — just point the OpenAI SDK at a different base URL:\n\n``` python\nasync def _call_google(system: str, messages: list[dict], max_tokens: int, json_mode: bool = False) -> str:\n    from openai import AsyncOpenAI\n    client = AsyncOpenAI(\n        api_key=settings.google_api_key,\n        base_url=\"https://generativelanguage.googleapis.com/v1beta/openai/\",\n    )\n    kwargs = {\n        \"model\": \"gemini-2.0-flash\",\n        \"max_tokens\": max_tokens,\n        \"messages\": [{\"role\": \"system\", \"content\": system}, *messages],\n    }\n    if json_mode:\n        kwargs[\"response_format\"] = {\"type\": \"json_object\"}\n    response = await client.chat.completions.create(**kwargs)\n    return response.choices[0].message.content or \"\"\n```\n\nSame pattern works for streaming — just use `stream=True`\n\nand iterate `async for chunk in stream`\n\n.\n\nThis is a pattern worth knowing: Groq, Azure OpenAI, and Google AI Studio all support the OpenAI-compatible endpoint format. If you write against the OpenAI SDK with configurable `base_url`\n\nand `api_key`\n\n, you get multi-provider support with almost no extra code.\n\nHere's where it got messy. After the multi-agent pipeline runs and generates a masterplan, Socra needs structured JSON back from the LLM — eval scores, assumption tracking, quick reply choices. The original approach was a separator in the stream:\n\n```\nStream: \"Here are my questions... ###JSON###{\"eval_delta\": {...}, \"choices\": [...]}\"\n```\n\nThis worked fine with Anthropic (Claude follows formatting instructions reliably). It broke completely with smaller models.\n\nThe 8B Groq model would occasionally include the separator, occasionally not, occasionally put it in the middle of a sentence. Parsing failed silently and `choices`\n\ncame back empty — users saw no quick reply options after the first message.\n\n**The fix: two separate calls.**\n\n```\n# Call 1: Stream plain text, no format requirements\nasync for token in _stream_llm_tokens(system, messages):\n    yield token\n    full_message += token\n\n# Call 2: After streaming ends, get structured data separately\neval_data = await _call_llm(\n    system=eval_system_prompt,\n    messages=messages + [{\"role\": \"assistant\", \"content\": full_message}],\n    json_mode=True\n)\n```\n\nThe Anthropic path still uses the separator (it's reliable there and saves one API call). The Groq and Google paths use two calls. A bit more latency, zero parsing failures.\n\nThis one cost me 45 minutes.\n\nAfter deploying to Railway, every LLM call was failing with `Illegal header value`\n\n. The API key was correct — I'd copied it straight from the Groq console. Except I hadn't. I'd pasted it into Railway's Variables tab and there was an invisible `\\n`\n\nat the end.\n\nThe fix was two things:\n\n`.strip()`\n\ndefensively in `config.py`\n\n:\n\n```\nclass Settings(BaseSettings):\n    groq_api_key: str = \"\"\n    anthropic_api_key: str = \"\"\n    google_api_key: str = \"\"\n\n    @validator('groq_api_key', 'anthropic_api_key', 'google_api_key', pre=True)\n    def strip_keys(cls, v):\n        return v.strip() if v else v\n```\n\nNow the app is defensive against copy-paste mistakes. The `.strip()`\n\ncosts nothing and prevents a class of errors that are genuinely hard to debug.\n\nAfter adding Google as the second provider, I pushed to Railway and checked the logs. They said:\n\n```\nUsing Groq LLaMA for LLM calls\n```\n\nBut I'd set `GOOGLE_API_KEY`\n\n. For two days I thought Google wasn't working. It was. The startup log was wrong.\n\nThe `main.py`\n\nlifespan check had a bug:\n\n```\n# Before — skipped Google entirely\nif settings.anthropic_api_key:\n    logger.info(\"Using Anthropic Claude\")\nelif settings.groq_api_key:         # ← checked Groq before Google\n    logger.info(\"Using Groq LLaMA\")\n```\n\nThe actual routing in `_call_llm`\n\nwas correct (Google checked second, before Groq). But the log check had a different order — so if Groq was also set (it was), it logged \"Using Groq\" even though every actual call was going to Google.\n\nFix: mirror the routing logic exactly in the startup log.\n\nRunning 5 parallel specialist agents against Groq's 6k TPM free tier: the math never worked and I was pretending it did.\n\nEach agent gets ~1,500 input tokens + generates ~400 output tokens = ~1,900 tokens per call. 5 parallel calls = 9,500 tokens launched simultaneously. Groq's rate limiter sees all 9,500 in the same minute window and rejects the overflow.\n\nThree approaches I tried, in order:\n\n**Approach 1: Retry with backoff.** Added 3-attempt retry with 4s/8s exponential backoff on 429 errors. Helped slightly. Didn't fix the underlying math.\n\n**Approach 2: Sequential execution with delays.** Switched from `asyncio.gather()`\n\nto sequential calls with 1.5s gaps between agents. This spread the token burst across multiple rate-limit windows. Worked on Groq, but added ~7.5s to the masterplan pipeline — noticeable.\n\n**Approach 3: Switch to Google.** Google's free tier is 1,000,000 TPM. Problem disappeared entirely. Now Groq is the fallback, not the primary.\n\nThe real lesson: design for the rate limits of your fallback providers, not just your primary. Groq is fast and cheap but not meant for parallel multi-agent workloads on the free tier.\n\nAfter switching to Google as the production default, I did a full token and cost breakdown per session:\n\n| Stage | Input tokens | Output tokens |\n|---|---|---|\n| Conversation (7 turns avg) | ~16,700 | ~3,500 |\n| 5 specialist agents | ~24,000 | ~3,500 |\n| Synthesis | ~12,700 | ~2,500 |\n| Devil's advocate | ~2,800 | ~600 |\nTotal per session |\n~56,200 |\n~10,100 |\n\nAt Google Gemini Flash pricing ($0.075 input / $0.30 output per million tokens):\n\n```\nInput cost:  56,200 / 1,000,000 × $0.075 = $0.0042\nOutput cost: 10,100 / 1,000,000 × $0.30  = $0.0030\nTotal:       ~$0.007 per session\n```\n\nSocra charges ₹499 (~$6) for a full masterplan session. LLM cost per session: **$0.007**. That's **99.8% gross margin on the LLM cost alone**.\n\nRailway hosting is ~$30/month fixed. Break-even is roughly 6 paid sessions per month.\n\nThis math only works because of the provider choice. The same session on Anthropic Haiku costs ~$0.085 — 12× more expensive, which would put margins at ~98.6%. Still fine, but the point is: provider selection is a product decision, not just a technical one.\n\n**1. Design for multi-provider from day one.** I added the fallback chain in Phase 3 after production broke. It should have been in the architecture from the start. The routing abstraction (`_call_llm`\n\nwith provider detection) is simple enough to add in 30 minutes — there's no reason to start with a single provider.\n\n**2. Test the rate limit math before deploying parallel calls.** 5 parallel agents × 1,900 tokens = 9,500 tokens in one burst. Groq's free tier is 6,000 TPM. This is elementary arithmetic that I didn't do until users were getting errors.\n\n**3. Strip API keys at the config layer.** `.strip()`\n\nin your settings class is a 5-minute change that eliminates an entire class of deployment bugs.\n\n**4. Make your startup log mirror your routing logic exactly.** A log that says \"Using Groq\" when you're actually using Google is worse than no log — it actively misleads debugging.\n\nSocra is built on: FastAPI + React + PostgreSQL + Railway + LangGraph (for the multi-agent pipeline) + Langfuse v4 (for per-call LLM observability) + Clerk (auth) + Razorpay (payments). The LLM fallback chain described here handles all LLM calls across the entire system — conversation, agents, synthesis, pitch deck generation, and the tribunal verdict scoring.\n\nThe live app is at [socra-production.up.railway.app](https://socra-production.up.railway.app). The approach described here — OpenAI-compatible endpoints, two-call structured output, provider detection at the config layer — is all running in production today.\n\n*I'm a pre-final year student at HBTU Kanpur building production ML systems. If you're working on something similar or have questions about the multi-agent architecture, I'm on LinkedIn and GitHub.*", "url": "https://wpnews.pro/news/how-i-built-a-3-provider-llm-fallback-system-in-production-and-what-actually", "canonical_source": "https://dev.to/ayush_notsogreat_b673d5/how-i-built-a-3-provider-llm-fallback-system-in-production-and-what-actually-broke-46jk", "published_at": "2026-06-17 21:06:14+00:00", "updated_at": "2026-06-17 21:21:17.395171+00:00", "lang": "en", "topics": ["large-language-models", "ai-infrastructure", "developer-tools", "ai-products", "ai-startups"], "entities": ["Socra", "Anthropic", "Google", "Groq", "Railway", "Claude Haiku", "Gemini 2.0 Flash", "LLaMA 3.1 8B"], "alternates": {"html": "https://wpnews.pro/news/how-i-built-a-3-provider-llm-fallback-system-in-production-and-what-actually", "markdown": "https://wpnews.pro/news/how-i-built-a-3-provider-llm-fallback-system-in-production-and-what-actually.md", "text": "https://wpnews.pro/news/how-i-built-a-3-provider-llm-fallback-system-in-production-and-what-actually.txt", "jsonld": "https://wpnews.pro/news/how-i-built-a-3-provider-llm-fallback-system-in-production-and-what-actually.jsonld"}}